Through Gobblin the common pattern is writing data out in Avro files and using the Hive registration to create partitions / tables. JDBC writer is possible, but is not written yet for Hive.
Abhishek On Sat, Mar 24, 2018 at 12:04 PM, Tilak Patidar <[email protected]> wrote: > Hey, > > The scale is small about 50-80GB. > I am going with the approach of reading CSVs from the S3 bucket and then > writing a custom converter which converts > FileAwareInputStream into GenericRecord. In this converter doing a batched > lookup and filtering out rows. However, > I was unable to find any HiveJDBCPublisher. How is data usually ingested > into hive using Gobblin? JDBC or something better? > > > *Regards**Tilak Patidar* > Email [email protected] > [email protected] > Telephone+91 8608690984 > [image: ThoughtWorks] <http://www.thoughtworks.com/>*"We are a community > of > passionate individuals whose purpose is to revolutionize software design, > creation and delivery, while advocating for positive social change"* > <http://www.thoughtworks.com/> > > On Fri, Mar 23, 2018 at 11:52 PM, Abhishek Tiwari <[email protected]> wrote: > > > + user@gobblin > > > > On Fri, Mar 23, 2018 at 11:19 AM, Abhishek Tiwari <[email protected]> > wrote: > > > > > (moved conversation from Gitter) > > > > > >> Tilak Patidar @tilakpatidar 02:21 > > >> Hi all, I have a use case about which I wanted a general idea to solve > > it > > >> using gobblin. > > >> We are getting the data from our client in form of CSV dumps in s3 > > >> buckets periodically. These dumps could be deltas or full dumps we > don’t > > >> know what it will be. We need to write this data into a hive table. > So, > > >> while writing we might have to check for changes in a row based on > > primary > > >> key and only update in hive if data has changed for that primary key. > > How > > >> can this be solved using Gobblin? I looked into hive merge but was > > >> wondering how could I use this with gobblin. > > > > > > > > > Hi Tilak, > > > > > > What kind of scale are you looking at? .. and do you have managed Hive > > > tables or external? > > > If I recall correctly, updates can only be applied to managed Hive ORC > > > tables. However, I suspect if lookup and update would work well at high > > > volume. If your volume is low and Hive table managed, then you can look > > > into S3 source, converter for lookup, and a JDBC writer. > > > However for high volume, your use-case looks similar to our database > > > ingest at LinkedIn. We ingest snapshots, as well as increments and > apply > > > them on snapshots. We materialize deltas into snapshots less frequently > > but > > > instead use specialized readers to read data from snapshot with deltas > > > applied on them at read time. The delta materialization into snapshot > is > > > done via a legacy system, which is on its way to be replaced with > > Gobblin. > > > > > > Abhishek > > > > > > > > > > > >
