Re: Apache Beam with Hive

Tomo Suzuki Fri, 14 Feb 2020 13:31:42 -0800

I see the Beam dependency 2.15 is a bit outdated. Would you try 2.19.0?

On Thu, Feb 13, 2020 at 3:52 AM Gershi, Noam <noam.ger...@citi.com> wrote:
>
> Hi
> Pom attached.
>
> -----Original Message-----
> From: [google.com] Tomo Suzuki <suzt...@google.com>
> Sent: Wednesday, February 12, 2020 4:56 PM
> To: user@beam.apache.org
> Subject: Re: Apache Beam with Hive
>
> HI Noam,
>
> It seems incompatible libraries are in your dependencies.
> Would you share pom.xml (or build.gradle) that produced the NoSuchMethodError?
>
> On Wed, Feb 12, 2020 at 12:47 AM Gershi, Noam <noam.ger...@citi.com> wrote:
> >
> > I used HCatalogIO:
> >
> >
> >
> > Map<String, String> configProperties = new HashMap<>();
> > configProperties.put("hive.metastore.uris", "....");
> >
> > pipeline.apply(HCatalogIO.read()
> >                 .withConfigProperties(configProperties)
> >                 .withDatabase(DB_NAME)
> >                 .withTable(TEST_TABLE))
> >         .apply("to-string", MapElements.via(new SimpleFunction<HCatRecord, 
> > String>() {
> >             @Override
> >             public String apply(HCatRecord input) {
> >                 return input.toString();
> >             }
> >         }))
> >
> > .apply(TextIO.write().to("my-logs/output.txt").withoutSharding());
> >
> >
> >
> > But I am getting this error:
> >
> >
> >
> > 20/02/11 07:49:24 INFO spark.SparkContext: Starting job: collect at
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__BoundedDataset.java&d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=xMTsprUk9mFvSvOcJWFCDcrzfvSEetFK1HgqSYnGbMQ&e=
> >  :93 Exception in thread "dag-scheduler-event-loop" 
> > java.lang.NoSuchMethodError: 
> > org.apache.hive.hcatalog.common.HCatUtil.getHiveMetastoreClient(Lorg/apache/hadoop/hive/conf/HiveConf;)Lorg/apache/hadoop/hive/metastore/IMetaStoreClient;
> >         at 
> > org.apache.beam.sdk.io.hcatalog.HCatalogUtils.createMetaStoreClient(https://urldefense.proofpoint.com/v2/url?u=http-3A__HCatalogUtils.java&d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=uGrTnwLNA2_Ujs7ChnrVtTm-npQ2Myuu-NNK0yu66MA&e=
> >  :42)
> >         at 
> > org.apache.beam.sdk.io.hcatalog.HCatalogIO$BoundedHCatalogSource.getEstimatedSizeBytes(https://urldefense.proofpoint.com/v2/url?u=http-3A__HCatalogIO.java&d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=9JEggr2DRm4S-qJMbmhlBZNciSLsMMbiSSvh8StIOHE&e=
> >  :323)
> >         at 
> > org.apache.beam.runners.spark.io.SourceRDD$Bounded.getPartitions(https://urldefense.proofpoint.com/v2/url?u=http-3A__SourceRDD.java&d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=4tyzNpdlBvQTL8Zz27JmOkcXgAjDmLj2pWhpKuMFLZc&e=
> >  :104)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at 
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at 
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at 
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at 
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at 
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at 
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at 
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at 
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at 
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at 
> > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> >        at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> >         at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:94)
> >         at 
> > org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:87)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:240)
> >         at 
> > org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:238)
> >         at scala.Option.getOrElse(Option.scala:121)
> >         at org.apache.spark.rdd.RDD.dependencies(RDD.scala:238)
> >         at 
> > org.apache.spark.scheduler.DAGScheduler.getShuffleDependencies(DAGScheduler.scala:512)
> >         at 
> > org.apache.spark.scheduler.DAGScheduler.getOrCreateParentStages(DAGScheduler.scala:461)
> >         at 
> > org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:448)
> >         at 
> > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:962)
> >         at 
> > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2067)
> >         at 
> > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
> >         at 
> > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
> >         at
> > org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> > 20/02/11 07:52:54 INFO util.JsonImpl: Shutting down HTTP client
> >
> >
> >
> >
> >
> > From: Gershi, Noam [ICG-IT]
> > Sent: Tuesday, February 11, 2020 11:08 AM
> > To: user@beam.apache.org
> > Subject: RE: Apache Beam with Hive
> >
> >
> >
> > Thanx.
> >
> > Looks like Hcatalog could work.
> >
> > But - is the an example with ‘SELECT’ query?
> >
> >
> >
> > JdbcIO probably not good to me, since my spark cluster in already 
> > configured to work with Hive. So – When I am code Spark pipelines, I can 
> > write queries without the need to give the user/password. – I would like to 
> > have something similar in Apache Beam.
> >
> >
> >
> > From:
> > [https://urldefense.proofpoint.com/v2/url?u=http-3A__gmail.com&d=DwIFa
> > Q&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTO
> > AM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=IjORTrY1VF_Wi1G1pPT
> > B4tTaYHMwaHFG1KATqPmvkfw&e= ] utkarsh.kh...@gmail.com
> > <utkarsh.kh...@gmail.com>
> > Sent: Tuesday, February 11, 2020 10:26 AM
> > To: user@beam.apache.org
> > Subject: Re: Apache Beam with Hive
> >
> >
> >
> >
> >
> > While I've not created Hive tables using this .. reading and writing worked 
> > very well using HCatalogIO.
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org_r
> > eleases_javadoc_2.18.0_org_apache_beam_sdk_io_hcatalog_HCatalogIO.html
> > &d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYkcVBTPEmCbP7YQPAjEPSbBXyqP
> > YqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1LD8RDrntO4o&s=c9eNd6p0ihI
> > DovylFf8P8OjjZ2j6hcZgDnIO_0CR_8c&e=
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Feb 11, 2020 at 11:38 AM Jean-Baptiste Onofre <j...@nanthrax.net> 
> > wrote:
> >
> > Hi,
> >
> >
> >
> > If you are ok to use Hive via JDBC, you can use JdbcIO and see the example 
> > in the javadoc:
> >
> >
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> > _beam_blob_master_sdks_java_io_jdbc_src_main_java_org_apache_beam_sdk_
> > io_jdbc_JdbcIO.java-23L82&d=DwIFaQ&c=j-EkbjBYwkAB4f8ZbVn1Fw&r=weSBkYYk
> > cVBTPEmCbP7YQPAjEPSbBXyqPYqK6vLTOAM&m=AgJosEDem12NXaKJZhxwD2jn-v5Tq7A1
> > LD8RDrntO4o&s=D563ULZXnhz13d4Ro61dMXYNpPVHUP4VupjBeoMwZz4&e=
> >
> >
> >
> > Regards
> >
> > JB
> >
> >
> >
> > Le 11 févr. 2020 à 07:03, Gershi, Noam <noam.ger...@citi.com> a écrit :
> >
> >
> >
> > Hi,
> >
> >
> >
> > I am searching for a detailed example how to use Apache Beam with Hive 
> > and/or Hcatalog?
> >
> > Creating tables, inserting data and fetching it…
> >
> >
> >
> >
> >
> >   Noam Gershi     Software Developer
> >
> >                               ICG Technology – TLV Lab
> >
> >       T: +972 (3) 7405718
> >
> >
> >
> >    <image002.png>
> >
> >
>
>
>
> --
> Regards,
> Tomo




-- 
Regards,
Tomo

Re: Apache Beam with Hive

Reply via email to