Re: S3 to Hive

Abhishek Tiwari Sat, 24 Mar 2018 16:33:07 -0700

Through Gobblin the common pattern is writing data out in Avro files and
using the Hive registration to create partitions / tables.
JDBC writer is possible, but is not written yet for Hive.


Abhishek

On Sat, Mar 24, 2018 at 12:04 PM, Tilak Patidar <[email protected]>
wrote:

> Hey,
>
> The scale is small about 50-80GB.
> I am going with the approach of reading CSVs from the S3 bucket and then
> writing a custom converter which converts
>  FileAwareInputStream into GenericRecord. In this converter doing a batched
> lookup and filtering out rows. However,
> I was unable to find any HiveJDBCPublisher. How is data usually ingested
> into hive using Gobblin? JDBC or something better?
>
>
> *Regards**Tilak Patidar*
> Email             [email protected]
>                       [email protected]
> Telephone+91 8608690984
> [image: ThoughtWorks] <http://www.thoughtworks.com/>*"We are a community
> of
> passionate individuals whose purpose is to revolutionize software design,
> creation and delivery, while advocating for positive social change"*
> <http://www.thoughtworks.com/>
>
> On Fri, Mar 23, 2018 at 11:52 PM, Abhishek Tiwari <[email protected]> wrote:
>
> > + user@gobblin
> >
> > On Fri, Mar 23, 2018 at 11:19 AM, Abhishek Tiwari <[email protected]>
> wrote:
> >
> > > (moved conversation from Gitter)
> > >
> > >> Tilak Patidar @tilakpatidar 02:21
> > >> Hi all, I have a use case about which I wanted a general idea to solve
> > it
> > >> using gobblin.
> > >> We are getting the data from our client in form of CSV dumps in s3
> > >> buckets periodically. These dumps could be deltas or full dumps we
> don’t
> > >> know what it will be. We need to write this data into a hive table.
> So,
> > >> while writing we might have to check for changes in a row based on
> > primary
> > >> key and only update in hive if data has changed for that primary key.
> > How
> > >> can this be solved using Gobblin? I looked into hive merge but was
> > >> wondering how could I use this with gobblin.
> > >
> > >
> > > Hi Tilak,
> > >
> > > What kind of scale are you looking at? .. and do you have managed Hive
> > > tables or external?
> > > If I recall correctly, updates can only be applied to managed Hive ORC
> > > tables. However, I suspect if lookup and update would work well at high
> > > volume. If your volume is low and Hive table managed, then you can look
> > > into S3 source, converter for lookup, and a JDBC writer.
> > > However for high volume, your use-case looks similar to our database
> > > ingest at LinkedIn. We ingest snapshots, as well as increments and
> apply
> > > them on snapshots. We materialize deltas into snapshots less frequently
> > but
> > > instead use specialized readers to read data from snapshot with deltas
> > > applied on them at read time. The delta materialization into snapshot
> is
> > > done via a legacy system, which is on its way to be replaced with
> > Gobblin.
> > >
> > > Abhishek
> > >
> > >
> > >
> >
>

Re: S3 to Hive

Reply via email to