My suggestion does require a change to your ETL process, but it doesn't
require you to copy the data into HDFS or to create storage clusters.  Hive
managed tables can reside in S3 with no problem.

Alan.

On Thu, Apr 25, 2019 at 2:18 PM Thai Bui <blquyt...@gmail.com> wrote:

> Your suggested workflow will work and it would require us to re-ETL data
> from S3 to all over the place to multiple clusters. This is a cumbersome
> approach since most of our data reside on S3 and clusters are somewhat
> transient in nature (in the order of a few months for a redeployment &
> don't have large HDFS capacity).
>
> We do scale clusters up and down for compute but not for storage since HDFS
> is not easy to be scaled down on demand. It would be much more preferable
> in this architecture to have Hive behaves as a pure compute engine that can
> be accelerated through query result caching and materialized views.
>
> I'm not that familiar with Hive 3 implementation to know if this feature
> would be simple to make. I was hoping to change only the front-end of Hive
> and keep the ACID back-end implementation intact. For example, we could
> reuse the transactional_properties and add 'read_only' as a new value. With
> read-only tables, all INSERT, UPDATE, DELETE statements will fail at Hive
> front-end. Thus, it ensures that the ACID properties are guaranteed and the
> rest of ACID assumptions on the backend could continue to work. For DDL
> operations, since it has to go through the metastore I think it would
> automatically work with the current ACID code base and the only thing we
> need to do is to enable (where it was disabled) and test it.
>
> On Wed, Apr 24, 2019 at 6:05 PM Alan Gates <alanfga...@gmail.com> wrote:
>
> > Would a workflow like the following work then:
> > 1. Non-Hive tool produces data
> > 2. Do a Hive load into a managed table.  This effectively takes a
> snapshot
> > of the data.
> > 3. Now you still have the data for Non-Hive tools to operate on, and in
> > Hive you get all the Hive 3 goodness.
> >
> > This would introduce an additional copy of the data.  It would be
> > interesting to look at adding a copy on write semantic to a partition to
> > avoid this copy, but you don't need that to get going.
> >
> > I'm not opposed to what you're suggesting, I'm just wondering if there
> are
> > other ways that will save you work and that will keep Hive more simple.
> >
> > Alan.
> >
> > On Wed, Apr 24, 2019 at 2:07 PM Thai Bui <blquyt...@gmail.com> wrote:
> >
> > > As I understand, read-only ACID tables only work if your table is a
> > managed
> > > table (so you'll have to create your table with CREATE TABLE
> > > .. TBLPROPERTIES ('transactional_properties'='insert_only') ) and Hive
> > will
> > > control the data layout.
> > >
> > > Unfortunately, in my case, I'm concerned with external tables where
> data
> > is
> > > written by other tools such as Spark, PySpark, Sqoop or older Hive
> > clusters
> > > and Hadoop-based systems to cloud storage such as S3. My wish is to
> have
> > > materialized views and query result caching work directly on those data
> > if
> > > and only if the table is registered as an external, read-only table in
> > Hive
> > > 3 via the same ACID mechanism.
> > >
> > > On Wed, Apr 24, 2019 at 3:35 PM Alan Gates <alanfga...@gmail.com>
> wrote:
> > >
> > > > Have you looked at the insert only ACID tables in Hive 3 (
> > > > https://issues.apache.org/jira/browse/HIVE-14535 )?  These were
> > designed
> > > > specifically with the cloud in mind, since the way Hive traditionally
> > > adds
> > > > new data doesn't work well in the cloud.  And they do not require
> ORC,
> > > they
> > > > work with any file format.
> > > >
> > > > Alan.
> > > >
> > > > On Wed, Apr 24, 2019 at 12:04 PM Thai Bui <blquyt...@gmail.com>
> wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > Hive 3 has brought significant changes to the community with the
> > > support
> > > > > for ACID tables as default managed tables. With ACID tables, we can
> > use
> > > > > features such as materialized views, query result caching for BI
> > tools
> > > > and
> > > > > more. But without ACID tables such as external tables, Hive doesn't
> > > > support
> > > > > any of these advanced features which makes a majority of
> cloud-native
> > > > users
> > > > > like me sad :(.
> > > > >
> > > > > I propose we should support a more limited version of read-only
> > > external
> > > > > tables such that materialized views and query result caching would
> > > work.
> > > > > For example:
> > > > >
> > > > > CREATE EXTERNAL TABLE table_name (..) STORED AS ORC
> > > > > LOCATION 's3://some-bucket/some-dir'
> > > > > TBLPROPERTIES ('read-only': "true");
> > > > >
> > > > > In such tables, any data modification operations such as INSERT and
> > > > UPDATE
> > > > > would fail and DDL operations that "add" or "remove" partitions to
> > the
> > > > > table would succeed such as "ALTER TABLE ... ADD PARTITION". This
> > would
> > > > > make it possible for Hive to invalidate the cache and materialized
> > > views
> > > > > even when the table is an external table.
> > > > >
> > > > > Let me know what do you guys think and maybe I can start writing a
> > wiki
> > > > > document describing the approach in greater details.
> > > > >
> > > > > Thanks,
> > > > > Thai
> > > > >
> > > >
> > >
> > >
> > > --
> > > Thai
> > >
> >
>
>
> --
> Thai
>

Reply via email to