Hi All,

Need your thoughts/inputs on a custom Data Source for accessing Rest based
services in parallel using Spark.

Many a times for business applications (batch oriented) one has to call a
target Rest service for a high number of times (with different set of
values of parameters/KV pairs).

The example use cases for the same are -

- Getting results/prediction from Machine Learning/NLP systems,
- Accessing utility APIs (like address validation) in bulk for 1000s of
inputs
- Ingesting data from systems who support only parametric data query (say
for time series data),
- Indexing data to Search systems
- Web crawling
- Accessing business applications which do not support bulk download
- others ....

Typically, for these use cases, the number of time the Service is called
(with various parameters/data) can be high. So people use/develop a
parallel processing framework (specific to his/her choice of language) to
call the APIs in parallel. But typically it is hard to make such thing run
in a distributed manner using multiple machines.

I found Spark's distributed programming paradigm can be used in a great way
for this. And was trying to create a custom Data Source for the same. Here
is the link to the repo -
https://github.com/sourav-mazumder/Data-Science-Extensions/tree/master/spark-datasource-rest

The interface goes like this :
- Inputs : REST API endpoint URL, input Data in a Temporary Spark Table -
the name of the table has to be passed, type of method (Get, Post, Put or
Delete), userid/password (for the sites which need authentication),
connection parameters (connection time, read time), parameter to call the
target Rest API only once (useful for services for which you have to pay or
have a daily/hourly limit)
- Output : A DataFrame with Rows of Struct. The Struct will have the output
returned by the target API.

Any thoughts/inputs on this  ?
a) Will this be useful for the applications/use cases you develop ?
b) What you typically use to address this type of needs ?
c) What else should be considered to make this framework more generic
/useful ?

Regards,
Sourav

P.S. I found this resource (https://www.alibabacloud.com/forum/read-474)
where the similar requirement is discussed and a solution is proposed. Not
sure what is the status of the proposal. However, some more things I found
need to be addressed in that proposal -
a) The proposal covers calling the Rest API for one set of key/value
parameter. In the above approach one can call same Rest API multiple times
with different sets of values of the keys.
b) There should be an option where Rest API should be called only once for
a given set of key/value parameters. This is important as many a times one
has to pay for accessing a Rest API and also there may be a limit per
day/hour basis.
c) Does not support calling a Rest service which is based on Post or other
HTTP methods
d) The results in other formats (like xml, csv) cannot be addressed

Reply via email to