Hi All, Need your thoughts/inputs on a custom Data Source for accessing Rest based services in parallel using Spark.
Many a times for business applications (batch oriented) one has to call a target Rest service for a high number of times (with different set of values of parameters/KV pairs). The example use cases for the same are - - Getting results/prediction from Machine Learning/NLP systems, - Accessing utility APIs (like address validation) in bulk for 1000s of inputs - Ingesting data from systems who support only parametric data query (say for time series data), - Indexing data to Search systems - Web crawling - Accessing business applications which do not support bulk download - others .... Typically, for these use cases, the number of time the Service is called (with various parameters/data) can be high. So people use/develop a parallel processing framework (specific to his/her choice of language) to call the APIs in parallel. But typically it is hard to make such thing run in a distributed manner using multiple machines. I found Spark's distributed programming paradigm can be used in a great way for this. And was trying to create a custom Data Source for the same. Here is the link to the repo - https://github.com/sourav-mazumder/Data-Science-Extensions/tree/master/spark-datasource-rest The interface goes like this : - Inputs : REST API endpoint URL, input Data in a Temporary Spark Table - the name of the table has to be passed, type of method (Get, Post, Put or Delete), userid/password (for the sites which need authentication), connection parameters (connection time, read time), parameter to call the target Rest API only once (useful for services for which you have to pay or have a daily/hourly limit) - Output : A DataFrame with Rows of Struct. The Struct will have the output returned by the target API. Any thoughts/inputs on this ? a) Will this be useful for the applications/use cases you develop ? b) What you typically use to address this type of needs ? c) What else should be considered to make this framework more generic /useful ? Regards, Sourav P.S. I found this resource (https://www.alibabacloud.com/forum/read-474) where the similar requirement is discussed and a solution is proposed. Not sure what is the status of the proposal. However, some more things I found need to be addressed in that proposal - a) The proposal covers calling the Rest API for one set of key/value parameter. In the above approach one can call same Rest API multiple times with different sets of values of the keys. b) There should be an option where Rest API should be called only once for a given set of key/value parameters. This is important as many a times one has to pay for accessing a Rest API and also there may be a limit per day/hour basis. c) Does not support calling a Rest service which is based on Post or other HTTP methods d) The results in other formats (like xml, csv) cannot be addressed