Re: HCatalogIO - Trying to read table metadata (columns names and indexes)

2020-04-28 Thread rahul patwari
Hi Noam, Currently, Beam doesn't support conversion of HCatRecords to Rows (or) in your case creating Beam Schema from Hive table schema, when the Hive table have parameterized types. We can use HCatFieldSchema[1] to create the Beam Schema from the Hive table Schema. I have created a JIRA ticket

Re: SparkRunner on k8s

2020-04-10 Thread rahul patwari
Hi Buvana, You can submit a Beam Pipeline to Spark on k8s like any other Spark Pipeline using the spark-submit script. Create an Uber Jar of your Beam code and provide it as the primary resource to spark-submit. Provide the k8s master and the container image to use as arguments to spark-submit.

Latency in advancing Spark Watermarks

2020-03-19 Thread rahul patwari
Hi, *Usage Info*: We are using Beam: 2.16.0, Spark: 2.4.2 We are running Spark on Kubernetes. We are using Spark Streaming(legacy) Runner with Beam Java SDK The Pipeline has been run with default configurations i.e. default configurations for SparkPipelineOptions. *Issue*: When a Beam Pipeline

Re: Unbounded input join Unbounded input then write to Bounded Sink

2020-02-24 Thread rahul patwari
Hi Kenn, Rui, The pipeline that we are trying is exactly what Kenn has mentioned above i.e. Read From Kafka => Apply Fixed Windows of 1 Min => SqlTransform => Write to Hive using HcatalogIO We are interested in understanding the behaviour when the source is Unbounded and Sink is bounded as this

Re: Kafka Avro Schema Registry Support

2020-02-04 Thread rahul patwari
>>>>> >>>>> Hi Raghu, >>>>> >>>>> The deserializer is provided by confluent >>>>> *io.confluent.kafka.serializers* package. >>>>> >>>>> When we set valueDeserializer as KafkaAvroDeserializer. We are >

Re: Multiple triggers contained w/in side input?

2019-11-06 Thread rahul patwari
Hi Kenn, Does the side input has elements from the previous trigger even when used with .discardingFiredPanes() like https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs Does View.asSingleton() affect this behaviour? Thanks, Rahul On Wed, Nov

Re: Joining PCollections to aggregates of themselves

2019-10-10 Thread rahul patwari
n a desired > order, e.g. by sorted by event time? > > Most aggregations we're likely to be running are per-key rather than > global, so the parallelism issue might not be such a big deal. > > Regards, > Sam > > On Thu, Oct 10, 2019 at 5:30 PM rahul patwari > wrote: > &g

Re: Joining PCollections to aggregates of themselves

2019-10-10 Thread rahul patwari
Hi Sam, (Assuming all the tuples have the same key) One solution could be to use ParDo with State(to calculate mean) => For each element as they occur, calculate the Mean(store the sum and count as the state) and emit the tuple with the new average value. But it will limit the parallelism count.

Re: Watermark is lagging in Spark Runner with kafkaIO

2019-09-10 Thread rahul patwari
Forgot to mention: A FixedWindow of duration 1 minute is applied before applying SqlTransform. On Tue, Sep 10, 2019 at 6:03 PM rahul patwari wrote: > Hi, > I am facing this issue too. > +dev > > Here is the Pipeline that we are using(providing a very simple pipeline to > h

Re: Watermark is lagging in Spark Runner with kafkaIO

2019-09-10 Thread rahul patwari
Hi, I am facing this issue too. +dev Here is the Pipeline that we are using(providing a very simple pipeline to highlight the issue): KafkaSource -> SqlTransform -> KafkaSink We are reading from a single topic in KafkaSource with a single partition. Here is the data that we are producing to

Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread rahul patwari
[per Key, Per Window] Processing. On Thu, Jul 25, 2019 at 10:08 PM Reuven Lax wrote: > Have you looked at the GroupIntoBatches transform? > > On Thu, Jul 25, 2019 at 9:34 AM rahul patwari > wrote: > >> So, If an RPC call has to be performed for a batch of >> Rows(PC

Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread rahul patwari
hen apply a > Stateful DoFn, though in that case all elements would get processed on > the same worker.) > > On Thu, Jul 25, 2019 at 6:06 PM rahul patwari > wrote: > > > > Hi, > > > > https://beam.apache.org/blog/2017/02/13/stateful-processing.html gives >

Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread rahul patwari
Hi, https://beam.apache.org/blog/2017/02/13/stateful-processing.html gives an example of assigning an arbitrary-but-consistent index to each element on a per key-and-window basis. If the Stateful ParDo is applied on a Non-Keyed PCollection, say, PCollection with Fixed Windows, the state is

Re: Slowly changing lookup cache as a Table in BeamSql

2019-07-16 Thread rahul patwari
code example ) >>> https://beam.apache.org/documentation/patterns/side-input-patterns/. >>> >>> Please note in the DoFn that feeds the View.asSingleton() you will need >>> to manually call BigQuery using the BigQuery client. >>> >>> Regards >

Custom Watermark Instance being created multiple times for KafkaIO

2019-05-21 Thread rahul patwari
Hi, We are using withTimestampPolicyFactory (TimestampPolicyFactory

Re: NullPointerException - Session windows with Lateness in FlinkRunner

2019-03-27 Thread rahul patwari
+dev On Wed 27 Mar, 2019, 9:47 PM rahul patwari, wrote: > Hi, > I am using Beam 2.11.0, Runner - beam-runners-flink-1.7, Flink Cluster - > 1.7.2. > > I have this flow in my pipeline: > KafkaSource(withCreateTime()) --> ApplyWindow(SessionWindow with > gapDuration=1 Mi

NullPointerException - Session windows with Lateness in FlinkRunner

2019-03-27 Thread rahul patwari
Hi, I am using Beam 2.11.0, Runner - beam-runners-flink-1.7, Flink Cluster - 1.7.2. I have this flow in my pipeline: KafkaSource(withCreateTime()) --> ApplyWindow(SessionWindow with gapDuration=1 Minute, lateness=3 Minutes, AccumulatingFiredPanes, default trigger) --> BeamSQL(GroupBy query)

Re: joda-time dependency version

2019-03-21 Thread rahul patwari
in public API will remain *backwards compatible* for both source > and binary in the 2.x stream. > > This means you should be able to safely use Spark's version. > > D. > > On Thu, Mar 21, 2019 at 5:45 AM rahul patwari > wrote: > >> Hi Ismael, >> >> We ar

Re: joda-time dependency version

2019-03-20 Thread rahul patwari
ush the upgrade? > > Regards, > Ismaƫl > > On Wed, Mar 20, 2019 at 11:53 AM rahul patwari > wrote: > > > > Hi, > > > > Is there a plan to upgrade the dependency version of joda-time to 2.9.3 > or latest version? > > > > > > Thanks, > > Rahul >

joda-time dependency version

2019-03-20 Thread rahul patwari
Hi, Is there a plan to upgrade the dependency version of joda-time to 2.9.3 or latest version? Thanks, Rahul

WindowTypeDescriptor of output PCollection emitted from GroupByKey

2019-03-12 Thread rahul patwari
Hi, I am exploring sessions windowing in apache beam. I have created a pipeline to know the window start time and window end time of the elements emitted from GroupByKey, which groups elements to which sessions window was applied. I got the Exception: Exception in thread "main"

Re: Kafka LogAppendTimePolicy not throwing Exception when log.message.timestamp.type = CreateTime for the Kafka topic

2019-01-04 Thread rahul patwari
mestamp is not used to watermark. > > As you suggested, simpler fix might just be require every record's > timestamp_type to be LOG_APPEND_TIME (i.e. replace 'else if' with 'else'). > Is that safe? We don't want users to get stuck if some topics are expected > to have multiple timestam

Kafka Avro Schema Registry Support

2018-09-27 Thread rahul patwari
Hi, We have a usecase to read data from Kafka serialized with KafkaAvroSerializer and schema is present in Schema Registry. When we are trying to use ValueDeserializer as io.confluent.kafka.serializers.KafkaAvroDeserializer to get GenericRecord, we are seeing errors. Does KafkaIO.read()

Lateness for Spark

2018-09-07 Thread rahul patwari
Hi, We are running a Beam program on Spark. We are using 2.5.0 Beam and SparkRunner versions. We are seeing Late data in the output emitted by Spark. As per the capability Matrix, Lateness is not supported in Spark. Is it supported now? or Are we missing something? Steps: Read from Kafka, Apply