+ user@gobblin On Fri, Mar 23, 2018 at 11:19 AM, Abhishek Tiwari <[email protected]> wrote:
> (moved conversation from Gitter) > >> Tilak Patidar @tilakpatidar 02:21 >> Hi all, I have a use case about which I wanted a general idea to solve it >> using gobblin. >> We are getting the data from our client in form of CSV dumps in s3 >> buckets periodically. These dumps could be deltas or full dumps we don’t >> know what it will be. We need to write this data into a hive table. So, >> while writing we might have to check for changes in a row based on primary >> key and only update in hive if data has changed for that primary key. How >> can this be solved using Gobblin? I looked into hive merge but was >> wondering how could I use this with gobblin. > > > Hi Tilak, > > What kind of scale are you looking at? .. and do you have managed Hive > tables or external? > If I recall correctly, updates can only be applied to managed Hive ORC > tables. However, I suspect if lookup and update would work well at high > volume. If your volume is low and Hive table managed, then you can look > into S3 source, converter for lookup, and a JDBC writer. > However for high volume, your use-case looks similar to our database > ingest at LinkedIn. We ingest snapshots, as well as increments and apply > them on snapshots. We materialize deltas into snapshots less frequently but > instead use specialized readers to read data from snapshot with deltas > applied on them at read time. The delta materialization into snapshot is > done via a legacy system, which is on its way to be replaced with Gobblin. > > Abhishek > > >
