ShivaKumar SS created BAHIR-242:
-----------------------------------

             Summary: Support for more params and flexibilities in spark | 
google pubsub
                 Key: BAHIR-242
                 URL: https://issues.apache.org/jira/browse/BAHIR-242
             Project: Bahir
          Issue Type: Improvement
          Components: Spark Streaming Connectors, Spark Structured Streaming 
Connectors
    Affects Versions: Spark-2.3.0
            Reporter: ShivaKumar SS


Hi All,

I am using google pub-sub along with spark stream.

Following is my requirement : 

1. There are multiple publishers who pushes the message and i can expect the 
topic to recieve 7k - 10k messages per second. 

2. On the subscribers side, i have a spark streaming running on a high memory 
cluster with one worker having one executor. In the spark streaming app, the 
batch size is 10 secs, for every 10 secs I need to pull all the data from the 
topic and write it to a file. 
   
With this requirement I started with the code and assumed everything will work 
file, but unfortunately it didtn't due to following issues :

1. I see there there is hardcoded param which pull number of messages. 
currently it is hardcoded to 1000.
   
https://github.com/apache/bahir/blob/master/streaming-pubsub/src/main/scala/org/apache/spark/streaming/pubsub/PubsubInputDStream.scala#L234
 
   Due to which i am not able to pull messages more than 1000 messages in a 
batch, with my requirement i cannot increase my executors.

2. There are other various configurations available in the google's pubsub apis 
https://github.com/googleapis/java-pubsub which is completely missing here like 
manual ack,  increase in ack time, async ack etc. 

3. support for latest version of 2.x and 3.x spark versions. 

Is there any plan to develop these   ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to