Re: Processing huge amount of data from paged API

Jörn Franke Sun, 21 Jan 2018 13:26:30 -0800

Which device provides messages as thousands of http pages? This is obviously 
inefficient and it will not help much to run them in parallel. Furthermore with 
paging you risk that messages get los or you get duplicate messages. I still 
not get why nowadays applications download a lot of data through services that 
provide a paging mechanism - it has failed in the past it fails today and will 
fail in the future.


 Can’t the device push data on a bus eg Kafka? Maybe via stomp or similar ? In 
doubt the device could prepare a file with all the measurements and make the 
file available through http (this would be of course with resumeable downloads).

> On 21. Jan 2018, at 21:33, anonymous <claudioepa...@gmail.com> wrote:
> 
> Hello,
> 
> I'm in an IoT company, and I have a use case for which I would like to know
> if Apache Spark could be helpful. It's a very broad question, and sorry if
> it's long winded.
> 
> We have HTTP GET APIs to get two kinds of information:
> 1) The Device Messages API returns data about device messages (in JSON).
> 2) The Devices API returns information about devices (in JSON) -- for
> example, device name, device owner, etc. Each Device Message has a Device ID
> field, which points to the device which sent it.
> To make it clearer, we have devices, and each device can send many device
> messages.
> Our goal is to device messages and send them to an ElasticSearch index.
> 
> The two major problems are:
> 1) We need data to be denormalized. That is, we don't want to have one index
> for device messages, and a separate index for device information -- we want
> each device message to have the corresponding device's information attached
> to it. That is because ElasticSearch works best with denormalized data. So,
> we would like a solution that can join (as in an SQL join) the Device
> Message data with the Device data and apply some transformations to them
> before sending it to ElasticSearch. We can potentially have millions of
> devices and device messages, so this solution needs to be scalable.
> 2) Both the Device Messages API and the Devices API are paged, and can
> potentially have thousands of pages. We can potentially have millions of
> devices and device messages. Making HTTP requests for thousands of pages can
> become inefficient. So, it would be good to have a way to parallelize this
> process.
> 
> So, to be short, we would like a solution that can help with:
> 1) Joining and transforming large amounts of data (from a paged API) before
> sending it to ElasticSearch.
> 2) Making the process of sifting through all the pages in the paged APIs
> more efficient.
> 
> Can Apache Spark help with all this?
> 
> Thank you in advance.
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Processing huge amount of data from paged API

Reply via email to