Re: S3 to Hive

Abhishek Tiwari Fri, 23 Mar 2018 11:22:47 -0700

+ user@gobblin

On Fri, Mar 23, 2018 at 11:19 AM, Abhishek Tiwari <[email protected]> wrote:


> (moved conversation from Gitter)
>
>> Tilak Patidar @tilakpatidar 02:21
>> Hi all, I have a use case about which I wanted a general idea to solve it
>> using gobblin.
>> We are getting the data from our client in form of CSV dumps in s3
>> buckets periodically. These dumps could be deltas or full dumps we don’t
>> know what it will be. We need to write this data into a hive table. So,
>> while writing we might have to check for changes in a row based on primary
>> key and only update in hive if data has changed for that primary key. How
>> can this be solved using Gobblin? I looked into hive merge but was
>> wondering how could I use this with gobblin.
>
>
> Hi Tilak,
>
> What kind of scale are you looking at? .. and do you have managed Hive
> tables or external?
> If I recall correctly, updates can only be applied to managed Hive ORC
> tables. However, I suspect if lookup and update would work well at high
> volume. If your volume is low and Hive table managed, then you can look
> into S3 source, converter for lookup, and a JDBC writer.
> However for high volume, your use-case looks similar to our database
> ingest at LinkedIn. We ingest snapshots, as well as increments and apply
> them on snapshots. We materialize deltas into snapshots less frequently but
> instead use specialized readers to read data from snapshot with deltas
> applied on them at read time. The delta materialization into snapshot is
> done via a legacy system, which is on its way to be replaced with Gobblin.
>
> Abhishek
>
>
>

Re: S3 to Hive

Reply via email to