S3 to Hive

Abhishek Tiwari Fri, 23 Mar 2018 11:20:14 -0700

(moved conversation from Gitter)

> Tilak Patidar @tilakpatidar 02:21
> Hi all, I have a use case about which I wanted a general idea to solve it
> using gobblin.
> We are getting the data from our client in form of CSV dumps in s3 buckets
> periodically. These dumps could be deltas or full dumps we don’t know what
> it will be. We need to write this data into a hive table. So, while writing
> we might have to check for changes in a row based on primary key and only
> update in hive if data has changed for that primary key. How can this be
> solved using Gobblin? I looked into hive merge but was wondering how could
> I use this with gobblin.



Hi Tilak,

What kind of scale are you looking at? .. and do you have managed Hive
tables or external?
If I recall correctly, updates can only be applied to managed Hive ORC
tables. However, I suspect if lookup and update would work well at high
volume. If your volume is low and Hive table managed, then you can look
into S3 source, converter for lookup, and a JDBC writer.
However for high volume, your use-case looks similar to our database ingest
at LinkedIn. We ingest snapshots, as well as increments and apply them on
snapshots. We materialize deltas into snapshots less frequently but instead
use specialized readers to read data from snapshot with deltas applied on
them at read time. The delta materialization into snapshot is done via a
legacy system, which is on its way to be replaced with Gobblin.

Abhishek

S3 to Hive

Reply via email to