RE: [PROPOSAL] Apache Hive connector

Seshadri Raghunathan Sun, 14 May 2017 15:53:49 -0700

Many thanks for your input. It simply works by *configuring* 
HadoopInputFormatIO indeed !


Perhaps I will simply write an integration test case with this configuration 
which could serve as a reference for reading from Hive using HCatalog.  

I see existing integration tests for HIFIO here - 
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/jdk1.8-tests/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/integration/tests/,
 I will go ahead and write one for HCatalog.

 

Please let me know if you have any comments.

 

Thanks,

Seshadri

 

From: Eugene Kirpichov <[email protected] 
<mailto:[email protected]> >
Date: Fri, May 12, 2017 at 2:43 PM
Subject: Re: [PROPOSAL] Apache Hive connector
To: [email protected] <mailto:[email protected]> 


Hi!

Why do you need at all to override methods like computeSplitsIfNecessary -
is HCatalogIO substantially different from other HadoopInputFormat's that
it can not be handled by the generic code of HadoopInputFormatIO? I looked
at the implementation in your commit and it seems identical, except for one
line - "HCatInputFormat.setInput(conf.getHadoopConfiguration(), database,
table, filter)" - but this line seems like simply specifying the
Configuration for the HadoopInputFormatIO, which can be done by
HadoopInputFormatIO.withConfiguration().

I.e. so far it seems like HCatalogIO can be implemented by *configuring*
HadoopInputFormatIO, rather than extending it. Am I missing something?

On Fri, May 12, 2017 at 12:11 PM Seshadri Raghunathan <[email protected] 
<mailto:[email protected]> >
wrote:

> Hi Eugene,
>
> In order to reuse HadoopInputFormatIO, this is what I am thinking -
>
> 1. Extend HadoopInputFormatBoundedSource to create - HCatalogBoundedSource
> 2. Override necessary methods in HCatalogBoundedSource to perform
> HCatalog-specific steps. ( overriding computeSplitsIfNecessary() method
> should be enough as I see it now )
> 3. Use HCatalogBoundedSource and HadoopInputFormatReader in HCatalog
> wrapper class to perform IO
>
> Initially I started this way but since it involves modifying
> HadoopInputFormatReader
> / HadoopInputFormatBoundedSource to make it public / extensible, I wasn't
> sure if this fits with Beam authoring guidelines and hence came up with the
> solution I shared in my earlier note.
>
> Please let me know your thoughts !
>
> *HadoopInputFormatIO *-

>
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L172
>
> HadoopInputFormatBoundedSource -
>
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L367
>
> HadoopInputFormatReader -
>
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L584
>
> On Thu, May 11, 2017 at 4:57 PM, Seshadri Raghunathan <[email protected] 
> <mailto:[email protected]> >
> wrote:
>
> > Thanks Eugene, that makes sense. This solution heavily borrows on
> HadoopInputFormatIO
> > with a tweak for HCatalog (and related parameters). I will try to
> re-use  HadoopInputFormatIO
> > rather than the current approach.
> >
> > On Thu, May 11, 2017 at 4:44 PM, Eugene Kirpichov <
> > [email protected] <mailto:[email protected]> > wrote:
> >
> >> Thanks Seshadri! This seems to have a great deal of copy-paste from
> >> HadoopInputFormatIO. Is it possible to instead implement this connector
> as
> >> a wrapper around it, rather than copy-paste?
> >>
> >> On Thu, May 11, 2017 at 4:41 PM Seshadri Raghunathan <[email protected] 
> >> <mailto:[email protected]> 
> >
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> > Here is a draft implementation of this proposal -
> >> >
> >> > https://github.com/seshadri-cr/beam/commit/78cdf8772f2cd5bb9
> >> cd018b1c99c3ad0854157c1
> >> >
> >> > Many thanks to Ismaël Mejía who helped in a high level review &
> >> follow-up
> >> > of this design / approach.
> >> >
> >> > Looking forward for further review/comments from wider community to
> move
> >> > forward on this proposal.
> >> >
> >> > Thanks,
> >> > Seshadri
> >> >
> >> >
> >> > On Wed, May 10, 2017 at 3:05 PM, Madhusudan Borkar <
> [email protected] <mailto:[email protected]> >
> >> > wrote:
> >> >
> >> > > Hi all,
> >> > > Thank you for your response to the earlier proposal. Taking into
> >> account
> >> > > all the suggestions, we are making a new proposal for Hive
> connector.
> >> > > Please, let us know your feedback.
> >> > >
> >> > > [1]
> >> > > https://docs.google.com/document/d/1aeQRLXjVr38Z03_
> >> > > zWkHO9YQhtnj0jHoCfhsSNm-wxtA/edit?usp=sharing
> >> > >
> >> > > [2] https://issues.apache.org/jira/browse/BEAM-1158
> >> > > <https://issues.apache.org/jira/browse/BEAM-1158>
> >> > >
> >> > > Madhu Borkar
> >> > >
> >> >
> >>
> >
> >
>

RE: [PROPOSAL] Apache Hive connector

Reply via email to