Re: swap data in Kudu table

Boris Sat, 04 Aug 2018 13:28:48 -0700

Thanks so much Tomas, glad you liked it. But as you might have seen another
thread already, the workaround I've described won't work with Impala 2.12
due to a breaking change.


On Thu, Aug 2, 2018, 07:18 far...@tf-bic.sk <far...@tf-bic.sk> wrote:

> Thanks Boris for a great article!
> Tomas
>
> On 2018/07/25 19:56:10, Boris Tyukin <bo...@boristyukin.com> wrote:
> > Hi guys,
> >
> > thanks again for your help!  I just blogged about this
> >
> https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/
> >
> > BTW I did not have to invalidate or refresh metadata - it just worked
> with
> >  ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev
> cluster
> > so not sure if it is because of that but Impala/Kudu docs also do not
> > mention anything about metadata refresh.  Looks like Impala is keeping a
> > reference to uuid of the Kudu table not its actual name.
> >
> > One thing I am still puzzled is how Impala was able to finish my
> > long-running SELECT statement, that I had kicked off right before the
> swap.
> > I did not get any error messages and I could clearly see that Kudu tables
> > were getting renamed and dropped, while the query was still running in a
> > different session and completed 10 seconds after the swap. This is still
> a
> > mystery to me. The only explanation I have is that data was already in
> > Impala daemons memory and did not need Kudu tables at that point.
> >
> > Boris
> >
> >
> >
> > On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >
> > > you are guys are awesome, thanks!
> > >
> > > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week.
> > > Views might work as well but for a number of reasons want to keep it
> as my
> > > last resort :)
> > >
> > > On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon <t...@cloudera.com>
> wrote:
> > >
> > >> A couple other ideas from the Impala side:
> > >>
> > >> - could you use a view and alter the view to point to a different
> table?
> > >> Then all readers would be pointed at the view, and security
> permissions
> > >> could be on that view rather than the underlying tables?
> > >>
> > >> - I think if you use an external table in Impala you could use an
> ALTER
> > >> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point
> to a
> > >> different table. Then issue a 'refresh' on the impalads so that they
> load
> > >> the new metadata. Subsequent queries would hit the new underlying Kudu
> > >> table, but permissions and stats would be unchanged.
> > >>
> > >> -Todd
> > >>
> > >> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <mpe...@apache.org>
> wrote:
> > >>
> > >>> Hi Boris, those are good ideas. Currently Kudu does not have atomic
> bulk
> > >>> load capabilities or staging abilities. Theoretically renaming a
> partition
> > >>> atomically shouldn't be that hard to implement, since it's just a
> master
> > >>> metadata operation which can be done atomically, but it's not yet
> > >>> implemented.
> > >>>
> > >>> There is a JIRA to track a generic bulk load API here:
> > >>> https://issues.apache.org/jira/browse/KUDU-1370
> > >>>
> > >>> Since I couldn't find anything to track the specific features you
> > >>> mentioned, I just filed the following improvement JIRAs so we can
> track it:
> > >>>
> > >>>    - KUDU-2326: Support atomic bulk load operation
> > >>>    <https://issues.apache.org/jira/browse/KUDU-2326>
> > >>>    - KUDU-2327: Support atomic swap of tables or partitions
> > >>>    <https://issues.apache.org/jira/browse/KUDU-2327>
> > >>>
> > >>> Mike
> > >>>
> > >>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <bo...@boristyukin.com
> >
> > >>> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> I am trying to figure out the best and safest way to swap data in a
> > >>>> production Kudu table with data from a staging table.
> > >>>>
> > >>>> Basically, once in a while we need to perform a full reload of some
> > >>>> tables (once in a few months). These tables are pretty large with
> billions
> > >>>> of rows and we want to minimize the risk and downtime for users if
> > >>>> something bad happens in the middle of that process.
> > >>>>
> > >>>> With Hive and Impala on HDFS, we can use a very cool handy command
> LOAD
> > >>>> DATA INPATH. We can prepare data for reload in a staging table
> upfront and
> > >>>> this process might take many hours. Once staging table is ready, we
> can
> > >>>> issue LOAD DATA INPATH command which will move underlying HDFS
> files to a
> > >>>> production table - this operation is almost instant and the very
> last step
> > >>>> in our pipeline.
> > >>>>
> > >>>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE
> > >>>> PARTITION command.
> > >>>>
> > >>>> Now with Kudu, I cannot seem to find a good strategy. The only thing
> > >>>> came to my mind is to drop the production table and rename a
> staging table
> > >>>> to production table as the last step of the job, but in this case
> we are
> > >>>> going to lose statistics and security permissions.
> > >>>>
> > >>>> Any other ideas?
> > >>>>
> > >>>> Thanks!
> > >>>> Boris
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Todd Lipcon
> > >> Software Engineer, Cloudera
> > >>
> > >
> > >
> >
>

Re: swap data in Kudu table

Reply via email to