Processing huge amount of data from paged API

anonymous Sun, 21 Jan 2018 12:33:56 -0800

Hello,

I'm in an IoT company, and I have a use case for which I would like to know
if Apache Spark could be helpful. It's a very broad question, and sorry if
it's long winded.


We have HTTP GET APIs to get two kinds of information:
1) The Device Messages API returns data about device messages (in JSON).
2) The Devices API returns information about devices (in JSON) -- for
example, device name, device owner, etc. Each Device Message has a Device ID
field, which points to the device which sent it.
To make it clearer, we have devices, and each device can send many device
messages.
Our goal is to device messages and send them to an ElasticSearch index.

The two major problems are:
1) We need data to be denormalized. That is, we don't want to have one index
for device messages, and a separate index for device information -- we want
each device message to have the corresponding device's information attached
to it. That is because ElasticSearch works best with denormalized data. So,
we would like a solution that can join (as in an SQL join) the Device
Message data with the Device data and apply some transformations to them
before sending it to ElasticSearch. We can potentially have millions of
devices and device messages, so this solution needs to be scalable.
2) Both the Device Messages API and the Devices API are paged, and can
potentially have thousands of pages. We can potentially have millions of
devices and device messages. Making HTTP requests for thousands of pages can
become inefficient. So, it would be good to have a way to parallelize this
process.

So, to be short, we would like a solution that can help with:
1) Joining and transforming large amounts of data (from a paged API) before
sending it to ElasticSearch.
2) Making the process of sifting through all the pages in the paged APIs
more efficient.

Can Apache Spark help with all this?

Thank you in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Processing huge amount of data from paged API

Reply via email to