Hey, The scale is small about 50-80GB. I am going with the approach of reading CSVs from the S3 bucket and then writing a custom converter which converts FileAwareInputStream into GenericRecord. In this converter doing a batched lookup and filtering out rows. However, I was unable to find any HiveJDBCPublisher. How is data usually ingested into hive using Gobblin? JDBC or something better?
*Regards**Tilak Patidar* Email [email protected] [email protected] Telephone+91 8608690984 [image: ThoughtWorks] <http://www.thoughtworks.com/>*"We are a community of passionate individuals whose purpose is to revolutionize software design, creation and delivery, while advocating for positive social change"* <http://www.thoughtworks.com/> On Fri, Mar 23, 2018 at 11:52 PM, Abhishek Tiwari <[email protected]> wrote: > + user@gobblin > > On Fri, Mar 23, 2018 at 11:19 AM, Abhishek Tiwari <[email protected]> wrote: > > > (moved conversation from Gitter) > > > >> Tilak Patidar @tilakpatidar 02:21 > >> Hi all, I have a use case about which I wanted a general idea to solve > it > >> using gobblin. > >> We are getting the data from our client in form of CSV dumps in s3 > >> buckets periodically. These dumps could be deltas or full dumps we don’t > >> know what it will be. We need to write this data into a hive table. So, > >> while writing we might have to check for changes in a row based on > primary > >> key and only update in hive if data has changed for that primary key. > How > >> can this be solved using Gobblin? I looked into hive merge but was > >> wondering how could I use this with gobblin. > > > > > > Hi Tilak, > > > > What kind of scale are you looking at? .. and do you have managed Hive > > tables or external? > > If I recall correctly, updates can only be applied to managed Hive ORC > > tables. However, I suspect if lookup and update would work well at high > > volume. If your volume is low and Hive table managed, then you can look > > into S3 source, converter for lookup, and a JDBC writer. > > However for high volume, your use-case looks similar to our database > > ingest at LinkedIn. We ingest snapshots, as well as increments and apply > > them on snapshots. We materialize deltas into snapshots less frequently > but > > instead use specialized readers to read data from snapshot with deltas > > applied on them at read time. The delta materialization into snapshot is > > done via a legacy system, which is on its way to be replaced with > Gobblin. > > > > Abhishek > > > > > > >
