Have you tried controlling the number of partitions of the dataframe? Say
you have 5 partitions, it means you are making 5 concurrent calls to the
web service. The throughput of the web service would be your bottleneck and
Spark workers would be waiting for tasks, but if you cant control the REST
service, maybe its worth a shot.

Thanks,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>




On Wed, Jan 9, 2019 at 4:51 AM <em...@yeikel.com> wrote:

> I have a data frame for which I apply an UDF that calls a REST web
> service.  This web service is distributed in only a few nodes and it won’t
> be able to handle a massive load from Spark.
>
>
>
> Is it possible to rate limit this UDP? For example , something like 100
> op/s.
>
>
>
> If not , what are the options? Is splitting the df an option?
>
>
>
> I’ve read a similar question in Stack overflow [1] and the solution
> suggests Spark Streaming , but my application does not involve streaming.
> Do I need to turn the operations into a streaming workflow to achieve
> something like that?
>
>
>
> Current Workflow : Hive -> Spark ->  Service
>
>
>
> Thank you
>
>
>
> [1]
> https://stackoverflow.com/questions/43953882/how-to-rate-limit-a-spark-map-operation
>

Reply via email to