ShivaKumar SS created BAHIR-242:
-----------------------------------
Summary: Support for more params and flexibilities in spark |
google pubsub
Key: BAHIR-242
URL: https://issues.apache.org/jira/browse/BAHIR-242
Project: Bahir
Issue Type: Improvement
Components: Spark Streaming Connectors, Spark Structured Streaming
Connectors
Affects Versions: Spark-2.3.0
Reporter: ShivaKumar SS
Hi All,
I am using google pub-sub along with spark stream.
Following is my requirement :
1. There are multiple publishers who pushes the message and i can expect the
topic to recieve 7k - 10k messages per second.
2. On the subscribers side, i have a spark streaming running on a high memory
cluster with one worker having one executor. In the spark streaming app, the
batch size is 10 secs, for every 10 secs I need to pull all the data from the
topic and write it to a file.
With this requirement I started with the code and assumed everything will work
file, but unfortunately it didtn't due to following issues :
1. I see there there is hardcoded param which pull number of messages.
currently it is hardcoded to 1000.
https://github.com/apache/bahir/blob/master/streaming-pubsub/src/main/scala/org/apache/spark/streaming/pubsub/PubsubInputDStream.scala#L234
Due to which i am not able to pull messages more than 1000 messages in a
batch, with my requirement i cannot increase my executors.
2. There are other various configurations available in the google's pubsub apis
https://github.com/googleapis/java-pubsub which is completely missing here like
manual ack, increase in ack time, async ack etc.
3. support for latest version of 2.x and 3.x spark versions.
Is there any plan to develop these ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)