Re: Apache Beam with Hive

Jozef Vilcek Sun, 16 Feb 2020 23:52:09 -0800

Problem seems to be incompatibility of Hive's hcatalog version ... What
HCatalogIO expects and what you have on classpath. Beams HCatalogIO is is
compiled agains Hive 2.1, you are packing 1.2.0.



On Mon, Feb 17, 2020 at 7:47 AM Gershi, Noam <noam.ger...@citi.com> wrote:

> 2.19.0 did not work also...
>
>
> -----Original Message-----
> From: [google.com] Tomo Suzuki <suzt...@google.com>
> Sent: Friday, February 14, 2020 11:31 PM
> To: user@beam.apache.org
> Subject: Re: Apache Beam with Hive
>
> I see the Beam dependency 2.15 is a bit outdated. Would you try 2.19.0?
>
> On Thu, Feb 13, 2020 at 3:52 AM Gershi, Noam <noam.ger...@citi.com> wrote:
> >
> > Hi
> > Pom attached.
> >
> > -----Original Message-----
> > From:
> > [https://urldefense.proofpoint.com/v2/url?u=http-3A__google.com&d=DwIF
> > aQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLT
> > OAM&m=KliJIPbYBRpoKwpCFwR3xqXIqyf6i-XOZoBNwYS3uH8&s=-ippa2i9Q6aX5Vdq3E
> > NS37LBNCdexF17v8s3iSMipcE&e= ] Tomo Suzuki <suzt...@google.com>
> > Sent: Wednesday, February 12, 2020 4:56 PM
> > To: user@beam.apache.org
> > Subject: Re: Apache Beam with Hive
> >
> > HI Noam,
> >
> > It seems incompatible libraries are in your dependencies.
> > Would you share pom.xml (or build.gradle) that produced the
> NoSuchMethodError?
> >
> > On Wed, Feb 12, 2020 at 12:47 AM Gershi, Noam <noam.ger...@citi.com>
> wrote:
> > >
> > > I used HCatalogIO:
> > >
> > >
> > >
> > > Map<String, String> configProperties = new HashMap<>();
> > > configProperties.put("hive.metastore.uris", "....");
> > >
> > > pipeline.apply(HCatalogIO.read()
> > >                 .withConfigProperties(configProperties)
> > >                 .withDatabase(DB_NAME)
> > >                 .withTable(TEST_TABLE))
> > >         .apply("to-string", MapElements.via(new
> SimpleFunction<HCatRecord, String>() {
> > >             @Override
> > >             public String apply(HCatRecord input) {
> > >                 return input.toString();
> > >             }
> > >         }))
> > >
> > > .apply(TextIO.write().to("my-logs/output.txt").withoutSharding());
> > >
> > >
> > >
> > > But I am getting this error:
> > >
> > >
> > >
> > > 20/02/11 07:49:24 INFO spark.SparkContext: Starting job: collect at
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__BoundedDataset.java&d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=xMTsprUk9mFvSvOcJWFCDcrzfvSEetFK1HgqSYnGbMQ&e=
> :93 Exception in thread "dag-scheduler-event-loop"
> java.lang.NoSuchMethodError:
> org.apache.hive.hcatalog.common.HCatUtil.getHiveMetastoreClient(Lorg/apache/hadoop/hive/conf/HiveConf;)Lorg/apache/hadoop/hive/metastore/IMetaStoreClient;
> > >         at org.apache.beam.sdk.io
> .hcatalog.HCatalogUtils.createMetaStoreClient(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__HCatalogUtils.java&d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=uGrTnwLNA2_Ujs7ChnrVtTm-npQ2Myuu-NNK0yu66MA&e=
> :42)
> > >         at org.apache.beam.sdk.io
> .hcatalog.HCatalogIO$BoundedHCatalogSource.getEstimatedSizeBytes(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__HCatalogIO.java&d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=9JEggr2DRm4S-qJMbmhlBZNciSLsMMbiSSvh8StIOHE&e=
> :323)
> > >         at
> org.apache.beam.runners.spark.io.SourceRDD$Bounded.getPartitions(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SourceRDD.java&d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=4tyzNpdlBvQTL8Zz27JmOkcXgAjDmLj2pWhpKuMFLZc&e=
> :104)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> > >        at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> > >         at
> org.apache.spark.ShuffleDependency.<init>(Dependency.scala:94)
> > >         at
> org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:87)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:240)
> > >         at
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:238)
> > >         at scala.Option.getOrElse(Option.scala:121)
> > >         at org.apache.spark.rdd.RDD.dependencies(RDD.scala:238)
> > >         at
> org.apache.spark.scheduler.DAGScheduler.getShuffleDependencies(DAGScheduler.scala:512)
> > >         at
> org.apache.spark.scheduler.DAGScheduler.getOrCreateParentStages(DAGScheduler.scala:461)
> > >         at
> org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:448)
> > >         at
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:962)
> > >         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2067)
> > >         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
> > >         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
> > >         at
> > > org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> > > 20/02/11 07:52:54 INFO util.JsonImpl: Shutting down HTTP client
> > >
> > >
> > >
> > >
> > >
> > > From: Gershi, Noam [ICG-IT]
> > > Sent: Tuesday, February 11, 2020 11:08 AM
> > > To: user@beam.apache.org
> > > Subject: RE: Apache Beam with Hive
> > >
> > >
> > >
> > > Thanx.
> > >
> > > Looks like Hcatalog could work.
> > >
> > > But - is the an example with ‘SELECT’ query?
> > >
> > >
> > >
> > > JdbcIO probably not good to me, since my spark cluster in already
> configured to work with Hive. So – When I am code Spark pipelines, I can
> write queries without the need to give the user/password. – I would like to
> have something similar in Apache Beam.
> > >
> > >
> > >
> > > From:
> > > [https://urldefense.proofpoint.com/v2/url?u=http-3A__gmail.com&d=DwI
> > > Fa
> > > Q&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vL
> > > TO
> > > AM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=IjORTrY1VF_Wi1G1p
> > > PT B4tTaYHMwaHFG1KATqPmvkfw&e= ] utkarsh.kh...@gmail.com
> > > <utkarsh.kh...@gmail.com>
> > > Sent: Tuesday, February 11, 2020 10:26 AM
> > > To: user@beam.apache.org
> > > Subject: Re: Apache Beam with Hive
> > >
> > >
> > >
> > >
> > >
> > > While I've not created Hive tables using this .. reading and writing
> worked very well using HCatalogIO.
> > >
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org
> > > _r
> > > eleases_javadoc_2.18.0_org_apache_beam_sdk_io_hcatalog_HCatalogIO.ht
> > > ml
> > > &d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXy
> > > qP
> > > YqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=c9eNd6p0i
> > > hI
> > > DovylFf8P8OjjZ2j6hcZgDnIO_0CR_8c&e=
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Feb 11, 2020 at 11:38 AM Jean-Baptiste Onofre <j...@nanthrax.net>
> wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > If you are ok to use Hive via JDBC, you can use JdbcIO and see the
> example in the javadoc:
> > >
> > >
> > >
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apac
> > > he
> > > _beam_blob_master_sdks_java_io_jdbc_src_main_java_org_apache_beam_sd
> > > k_
> > > io_jdbc_JdbcIO.java-23L82&d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkY
> > > Yk
> > > cVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7
> > > A1 LD8RDrntO4o&s=D563ULZXnhz13d4Ro61dMXYNpPVHUP4VupjBeoMwZz4&e=
> > >
> > >
> > >
> > > Regards
> > >
> > > JB
> > >
> > >
> > >
> > > Le 11 févr. 2020 à 07:03, Gershi, Noam <noam.ger...@citi.com> a écrit
> :
> > >
> > >
> > >
> > > Hi,
> > >
> > >
> > >
> > > I am searching for a detailed example how to use Apache Beam with Hive
> and/or Hcatalog?
> > >
> > > Creating tables, inserting data and fetching it…
> > >
> > >
> > >
> > >
> > >
> > >   Noam Gershi     Software Developer
> > >
> > >                               ICG Technology – TLV Lab
> > >
> > >       T: +972 (3) 7405718
> > >
> > >
> > >
> > >    <image002.png>
> > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Tomo
>
>
>
> --
> Regards,
> Tomo
>

Re: Apache Beam with Hive

Reply via email to