Re: Processing huge amount of data from paged API

2018-01-21 Thread anonymous
The devices and device messages are retrieved using the APIs provided by
company X (not the company's real name), which owns the IoT network.
There is the option of setting HTTP POST callbacks for device messages, but
we want to be able to run analytics on messages of ALL the devices of the
network. Since the devices are owned by clients and we can't force all
clients to set callbacks on their devices, our only option remains to use
this GET API.
In fact, there are other APIs by company X and many of them are paged. Of
course, this is a major issue that should be addressed soon.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Processing huge amount of data from paged API

2018-01-21 Thread Jörn Franke
Which device provides messages as thousands of http pages? This is obviously 
inefficient and it will not help much to run them in parallel. Furthermore with 
paging you risk that messages get los or you get duplicate messages. I still 
not get why nowadays applications download a lot of data through services that 
provide a paging mechanism - it has failed in the past it fails today and will 
fail in the future.

 Can’t the device push data on a bus eg Kafka? Maybe via stomp or similar ? In 
doubt the device could prepare a file with all the measurements and make the 
file available through http (this would be of course with resumeable downloads).

> On 21. Jan 2018, at 21:33, anonymous  wrote:
> 
> Hello,
> 
> I'm in an IoT company, and I have a use case for which I would like to know
> if Apache Spark could be helpful. It's a very broad question, and sorry if
> it's long winded.
> 
> We have HTTP GET APIs to get two kinds of information:
> 1) The Device Messages API returns data about device messages (in JSON).
> 2) The Devices API returns information about devices (in JSON) -- for
> example, device name, device owner, etc. Each Device Message has a Device ID
> field, which points to the device which sent it.
> To make it clearer, we have devices, and each device can send many device
> messages.
> Our goal is to device messages and send them to an ElasticSearch index.
> 
> The two major problems are:
> 1) We need data to be denormalized. That is, we don't want to have one index
> for device messages, and a separate index for device information -- we want
> each device message to have the corresponding device's information attached
> to it. That is because ElasticSearch works best with denormalized data. So,
> we would like a solution that can join (as in an SQL join) the Device
> Message data with the Device data and apply some transformations to them
> before sending it to ElasticSearch. We can potentially have millions of
> devices and device messages, so this solution needs to be scalable.
> 2) Both the Device Messages API and the Devices API are paged, and can
> potentially have thousands of pages. We can potentially have millions of
> devices and device messages. Making HTTP requests for thousands of pages can
> become inefficient. So, it would be good to have a way to parallelize this
> process.
> 
> So, to be short, we would like a solution that can help with:
> 1) Joining and transforming large amounts of data (from a paged API) before
> sending it to ElasticSearch.
> 2) Making the process of sifting through all the pages in the paged APIs
> more efficient.
> 
> Can Apache Spark help with all this?
> 
> Thank you in advance.
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Processing huge amount of data from paged API

2018-01-21 Thread anonymous
Hello,

I'm in an IoT company, and I have a use case for which I would like to know
if Apache Spark could be helpful. It's a very broad question, and sorry if
it's long winded.

We have HTTP GET APIs to get two kinds of information:
1) The Device Messages API returns data about device messages (in JSON).
2) The Devices API returns information about devices (in JSON) -- for
example, device name, device owner, etc. Each Device Message has a Device ID
field, which points to the device which sent it.
To make it clearer, we have devices, and each device can send many device
messages.
Our goal is to device messages and send them to an ElasticSearch index.

The two major problems are:
1) We need data to be denormalized. That is, we don't want to have one index
for device messages, and a separate index for device information -- we want
each device message to have the corresponding device's information attached
to it. That is because ElasticSearch works best with denormalized data. So,
we would like a solution that can join (as in an SQL join) the Device
Message data with the Device data and apply some transformations to them
before sending it to ElasticSearch. We can potentially have millions of
devices and device messages, so this solution needs to be scalable.
2) Both the Device Messages API and the Devices API are paged, and can
potentially have thousands of pages. We can potentially have millions of
devices and device messages. Making HTTP requests for thousands of pages can
become inefficient. So, it would be good to have a way to parallelize this
process.

So, to be short, we would like a solution that can help with:
1) Joining and transforming large amounts of data (from a paged API) before
sending it to ElasticSearch.
2) Making the process of sifting through all the pages in the paged APIs
more efficient.

Can Apache Spark help with all this?

Thank you in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org