Hello, I'm in an IoT company, and I have a use case for which I would like to know if Apache Spark could be helpful. It's a very broad question, and sorry if it's long winded.
We have HTTP GET APIs to get two kinds of information: 1) The Device Messages API returns data about device messages (in JSON). 2) The Devices API returns information about devices (in JSON) -- for example, device name, device owner, etc. Each Device Message has a Device ID field, which points to the device which sent it. To make it clearer, we have devices, and each device can send many device messages. Our goal is to device messages and send them to an ElasticSearch index. The two major problems are: 1) We need data to be denormalized. That is, we don't want to have one index for device messages, and a separate index for device information -- we want each device message to have the corresponding device's information attached to it. That is because ElasticSearch works best with denormalized data. So, we would like a solution that can join (as in an SQL join) the Device Message data with the Device data and apply some transformations to them before sending it to ElasticSearch. We can potentially have millions of devices and device messages, so this solution needs to be scalable. 2) Both the Device Messages API and the Devices API are paged, and can potentially have thousands of pages. We can potentially have millions of devices and device messages. Making HTTP requests for thousands of pages can become inefficient. So, it would be good to have a way to parallelize this process. So, to be short, we would like a solution that can help with: 1) Joining and transforming large amounts of data (from a paged API) before sending it to ElasticSearch. 2) Making the process of sifting through all the pages in the paged APIs more efficient. Can Apache Spark help with all this? Thank you in advance. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org