(moved conversation from Gitter) > Tilak Patidar @tilakpatidar 02:21 > Hi all, I have a use case about which I wanted a general idea to solve it > using gobblin. > We are getting the data from our client in form of CSV dumps in s3 buckets > periodically. These dumps could be deltas or full dumps we don’t know what > it will be. We need to write this data into a hive table. So, while writing > we might have to check for changes in a row based on primary key and only > update in hive if data has changed for that primary key. How can this be > solved using Gobblin? I looked into hive merge but was wondering how could > I use this with gobblin.
Hi Tilak, What kind of scale are you looking at? .. and do you have managed Hive tables or external? If I recall correctly, updates can only be applied to managed Hive ORC tables. However, I suspect if lookup and update would work well at high volume. If your volume is low and Hive table managed, then you can look into S3 source, converter for lookup, and a JDBC writer. However for high volume, your use-case looks similar to our database ingest at LinkedIn. We ingest snapshots, as well as increments and apply them on snapshots. We materialize deltas into snapshots less frequently but instead use specialized readers to read data from snapshot with deltas applied on them at read time. The delta materialization into snapshot is done via a legacy system, which is on its way to be replaced with Gobblin. Abhishek
