Thanks so much Tomas, glad you liked it. But as you might have seen another thread already, the workaround I've described won't work with Impala 2.12 due to a breaking change.
On Thu, Aug 2, 2018, 07:18 far...@tf-bic.sk <far...@tf-bic.sk> wrote: > Thanks Boris for a great article! > Tomas > > On 2018/07/25 19:56:10, Boris Tyukin <bo...@boristyukin.com> wrote: > > Hi guys, > > > > thanks again for your help! I just blogged about this > > > https://boristyukin.com/how-to-hot-swap-apache-kudu-tables-with-apache-impala/ > > > > BTW I did not have to invalidate or refresh metadata - it just worked > with > > ALTER TABLE TBLPROPERTIES idea. We have one Kudu master on our dev > cluster > > so not sure if it is because of that but Impala/Kudu docs also do not > > mention anything about metadata refresh. Looks like Impala is keeping a > > reference to uuid of the Kudu table not its actual name. > > > > One thing I am still puzzled is how Impala was able to finish my > > long-running SELECT statement, that I had kicked off right before the > swap. > > I did not get any error messages and I could clearly see that Kudu tables > > were getting renamed and dropped, while the query was still running in a > > different session and completed 10 seconds after the swap. This is still > a > > mystery to me. The only explanation I have is that data was already in > > Impala daemons memory and did not need Kudu tables at that point. > > > > Boris > > > > > > > > On Fri, Feb 23, 2018 at 5:13 PM Boris Tyukin <bo...@boristyukin.com> > wrote: > > > > > you are guys are awesome, thanks! > > > > > > Todd, I like ALTER TABLE TBLPROPERTIES idea - will test it next week. > > > Views might work as well but for a number of reasons want to keep it > as my > > > last resort :) > > > > > > On Fri, Feb 23, 2018 at 4:32 PM, Todd Lipcon <t...@cloudera.com> > wrote: > > > > > >> A couple other ideas from the Impala side: > > >> > > >> - could you use a view and alter the view to point to a different > table? > > >> Then all readers would be pointed at the view, and security > permissions > > >> could be on that view rather than the underlying tables? > > >> > > >> - I think if you use an external table in Impala you could use an > ALTER > > >> TABLE TBLPROPERTIES ... statement to change kudu.table_name to point > to a > > >> different table. Then issue a 'refresh' on the impalads so that they > load > > >> the new metadata. Subsequent queries would hit the new underlying Kudu > > >> table, but permissions and stats would be unchanged. > > >> > > >> -Todd > > >> > > >> On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy <mpe...@apache.org> > wrote: > > >> > > >>> Hi Boris, those are good ideas. Currently Kudu does not have atomic > bulk > > >>> load capabilities or staging abilities. Theoretically renaming a > partition > > >>> atomically shouldn't be that hard to implement, since it's just a > master > > >>> metadata operation which can be done atomically, but it's not yet > > >>> implemented. > > >>> > > >>> There is a JIRA to track a generic bulk load API here: > > >>> https://issues.apache.org/jira/browse/KUDU-1370 > > >>> > > >>> Since I couldn't find anything to track the specific features you > > >>> mentioned, I just filed the following improvement JIRAs so we can > track it: > > >>> > > >>> - KUDU-2326: Support atomic bulk load operation > > >>> <https://issues.apache.org/jira/browse/KUDU-2326> > > >>> - KUDU-2327: Support atomic swap of tables or partitions > > >>> <https://issues.apache.org/jira/browse/KUDU-2327> > > >>> > > >>> Mike > > >>> > > >>> On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin <bo...@boristyukin.com > > > > >>> wrote: > > >>> > > >>>> Hello, > > >>>> > > >>>> I am trying to figure out the best and safest way to swap data in a > > >>>> production Kudu table with data from a staging table. > > >>>> > > >>>> Basically, once in a while we need to perform a full reload of some > > >>>> tables (once in a few months). These tables are pretty large with > billions > > >>>> of rows and we want to minimize the risk and downtime for users if > > >>>> something bad happens in the middle of that process. > > >>>> > > >>>> With Hive and Impala on HDFS, we can use a very cool handy command > LOAD > > >>>> DATA INPATH. We can prepare data for reload in a staging table > upfront and > > >>>> this process might take many hours. Once staging table is ready, we > can > > >>>> issue LOAD DATA INPATH command which will move underlying HDFS > files to a > > >>>> production table - this operation is almost instant and the very > last step > > >>>> in our pipeline. > > >>>> > > >>>> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE > > >>>> PARTITION command. > > >>>> > > >>>> Now with Kudu, I cannot seem to find a good strategy. The only thing > > >>>> came to my mind is to drop the production table and rename a > staging table > > >>>> to production table as the last step of the job, but in this case > we are > > >>>> going to lose statistics and security permissions. > > >>>> > > >>>> Any other ideas? > > >>>> > > >>>> Thanks! > > >>>> Boris > > >>>> > > >>> > > >>> > > >> > > >> > > >> -- > > >> Todd Lipcon > > >> Software Engineer, Cloudera > > >> > > > > > > > > >