Hey,

The scale is small about 50-80GB.
I am going with the approach of reading CSVs from the S3 bucket and then
writing a custom converter which converts
 FileAwareInputStream into GenericRecord. In this converter doing a batched
lookup and filtering out rows. However,
I was unable to find any HiveJDBCPublisher. How is data usually ingested
into hive using Gobblin? JDBC or something better?


*Regards**Tilak Patidar*
Email             [email protected]
                      [email protected]
Telephone+91 8608690984
[image: ThoughtWorks] <http://www.thoughtworks.com/>*"We are a community of
passionate individuals whose purpose is to revolutionize software design,
creation and delivery, while advocating for positive social change"*
<http://www.thoughtworks.com/>

On Fri, Mar 23, 2018 at 11:52 PM, Abhishek Tiwari <[email protected]> wrote:

> + user@gobblin
>
> On Fri, Mar 23, 2018 at 11:19 AM, Abhishek Tiwari <[email protected]> wrote:
>
> > (moved conversation from Gitter)
> >
> >> Tilak Patidar @tilakpatidar 02:21
> >> Hi all, I have a use case about which I wanted a general idea to solve
> it
> >> using gobblin.
> >> We are getting the data from our client in form of CSV dumps in s3
> >> buckets periodically. These dumps could be deltas or full dumps we don’t
> >> know what it will be. We need to write this data into a hive table. So,
> >> while writing we might have to check for changes in a row based on
> primary
> >> key and only update in hive if data has changed for that primary key.
> How
> >> can this be solved using Gobblin? I looked into hive merge but was
> >> wondering how could I use this with gobblin.
> >
> >
> > Hi Tilak,
> >
> > What kind of scale are you looking at? .. and do you have managed Hive
> > tables or external?
> > If I recall correctly, updates can only be applied to managed Hive ORC
> > tables. However, I suspect if lookup and update would work well at high
> > volume. If your volume is low and Hive table managed, then you can look
> > into S3 source, converter for lookup, and a JDBC writer.
> > However for high volume, your use-case looks similar to our database
> > ingest at LinkedIn. We ingest snapshots, as well as increments and apply
> > them on snapshots. We materialize deltas into snapshots less frequently
> but
> > instead use specialized readers to read data from snapshot with deltas
> > applied on them at read time. The delta materialization into snapshot is
> > done via a legacy system, which is on its way to be replaced with
> Gobblin.
> >
> > Abhishek
> >
> >
> >
>

Reply via email to