Aw: Re: Re: Re: Pull from Redis

Daniela S Tue, 31 May 2016 00:50:45 -0700

Thank you so much for your help!

The "profile" is a kind of step function, so I would store the values in an array. My key would be the program ID, further I would store the start time and the array with the values for each minute in Redis. Is it not possible to remove the key from Redis as soon as I receive an end event? I thought it would be easy to remove a specific key from Redis.

The problem is that I have to make a prediction in realtime afterwards. That was one of the reasons to use Storm. I would like to include a predictive model with PMML afterwards to predict the values for the next 24 hours in realtime. So after calculating the sum I would return the sum to Storm to make the prediction in another bolt, that contains the PMML model, is that right?

So summing up, would it be better to use other tools? I thought that the combination of Kafka, Storm and Redis would be great as my use case should be realtime and predictive. Therefore I did not consider batch processing tools, like Hadoop. Which tools would you recommend/suggest me to use for my use case?

Thank you and regards,

Daniela

Gesendet: Dienstag, 31. Mai 2016 um 01:11 Uhr
Von: "Yuri Kostin" <kost...@gmail.com>
An: user@storm.apache.org
Betreff: Re: Re: Re: Pull from Redis

In theory if you could store profile with the program start time you could calculate the value you need in whatever database you are storing it in. In redis you can do it with LUA script for example. My guess is that there might a point when iteration and calculation would take longer than a minute and essentially be out of date as soon as it finishes. Eliminating the lookup of profile during calculation might help with the processing time. If minute intervals are predictable ex. if minute 1 is 2, and minute 2 is 4 then minute 60 is 120, you only need to store “formula” or pattern for this calculation and not an array of all possible values. If multiple programs share this formula, you could preload them into a lookup and use “type” to look it up, etc. It’s a tricky one. If you use redis list, it would be more difficult to remove an item after you receive end time. Essentially you would need to keep track of both active and recently ended programs and check during calculation and either remove or leave the program in the list. This could be done by creating an empty redis key “program_id_ended” or something similar, then you can use redis to check if it exists while you are iterating and remove both value and this ended flag key and keep going. You can create a structure in storm, hash, etc. populate it with your programs and profiles and calculate this sum on system tick tuple. I don’t know what kind of performance and memory requirement you will get if you store millions of items, but you should be able to scale it across many servers. Durability of this approach is also not the same as redis, if topology goes down, this store will have to be rebuilt from somewhere. This is pretty simple map/reduce process, I am just not sure redis is the best tool for the job, maybe multiple redis servers to share the load of key iteration, if it becomes the bottleneck. I would try redis with either million items in a list or million keys, then use LUA to do your calculation and return its sum, and store profiles in the same json payload. This would be a benchmark for “ideal” situation using redis. It should be fairly easy to populate test redis db with some data using a script in any language.

On May 30, 2016, at 4:24 PM, Daniela S <daniela_4...@gmx.at> wrote:

No problem, I am so glad that you help me! Thank you!

No unfortunately this is not possible. I only receive events containing the program, the timestamp and the command "start" or "end". I could join the "profile" with each event but I am not sure if this makes sense as I still have to repeat my calculation every minute and to store my active programs anywhere. Otherwise I do not know which programs have already ended.

This is different according to the program, but I would say like 100 to 150 minute values per program. There could be millions of programs running at the same time.

Regards,

Daniela

Gesendet: Montag, 30. Mai 2016 um 23:17 Uhr
Von: "Yuri Kostin" <kost...@gmail.com>
An: user@storm.apache.org
Betreff: Re: Re: Re: Re: Pull from Redis

is it possible to store these values with original json payload? How many minute values are there? How many programs could be running at the same time?
I am sorry about all the questions, there are just so many ways this can be approached and every detail could make a difference.

On May 30, 2016, at 4:14 PM, Daniela S <daniela_4...@gmx.at> wrote:

No unfortunately not. Each program has its own "profile" with different values each minute.

Gesendet: Montag, 30. Mai 2016 um 23:04 Uhr
Von: "Yuri Kostine" <kost...@gmail.com>
An: user@storm.apache.org
Betreff: Re: Aw: Re: Re: Re: Pull from Redis

Makes sense. Do all programs have the same value at minute 3?

On May 30, 2016, at 3:55 PM, Daniela S <daniela_4...@gmx.at> wrote:

I try to explain a little bit more in detail:

I receive for example a start event for program X. When program X is finisehd I will receive an end message for program X. As long as I do not receive an end message for a program I assume that it is running and it should be stored in Redis.

Let's assume that program X has been started and I did not receive an end message yet. So I have to pull it from Redis and to calculate how far the program is at the moment (current time - start time). With this value, let's assume it is minute 3, I have to look up which value corresponds to minute 3. And this value is the value I need for my sum.

I have to do this for every started program and I have to repeat the sum building every minute as every program changes its value each minute, as long as it has not ended.

Thank you and regards,

Daniela

Gesendet: Montag, 30. Mai 2016 um 22:34 Uhr
Von: "Yuri Kostine" <kost...@gmail.com>
An: user@storm.apache.org
Betreff: Re: Aw: Re: Re: Pull from Redis

Is the sum the amount of time all current programs have been running? How does storm/redis know when the program is done and needs to be removed? For example, you get a json payload with a start time, no end time. You push that into redis key or list. 1 minute lapses (no other events have been written) you look at that json and calculate time in seconds etc, time now-start time. Let's say it's 120, then you take 120 and do what with it? And if there are 10 events, each returning 120 will that be 1200 > calculation or do you have to calculate each event by itself and then sum results because each event gets its own unique multiplier?

On May 30, 2016, at 2:35 PM, Daniela S <daniela_4...@gmx.at> wrote:

Thank you for your support! I will try to explain what I would like to do:

I am receiving JSON strings from Kafka. These JSON strings contain start and end events of programs. I would like to use Redis as cache to store all the programs, which are started but have not ended yet. As soon as a program has ended it should be deleted from Redis. I would like to build a sum over all programs stored in Redis. But I need another value to build the sum. To get this value I have to calculate the difference between the actual time and the timestamp of each event stored in Redis. With this calculated value I would like to look up the value I need to build the sum. This must be done for each stored entry and it should be repeated everytime a new value has been added or removed from Redis or otherwise every minute.

How should such problems be solved within Storm? I thought about a kind of cache like Redis.

Thank you in advance.

Regards,

Daniela

Gesendet: Montag, 30. Mai 2016 um 21:16 Uhr
Von: "Yuri Kostine" <kost...@gmail.com>
An: user@storm.apache.org
Betreff: Re: Aw: Re: Pull from Redis

It depends on definition of slow and data stored of course, my guess is that few million of keys might take a minute? Pure guess. Redis is a key value store, you give it a key and you can perform an operation on its value. Iterating over all keys is the slowest operation in redis. I think it will also block all other operations while this one is executing. I know this is a storm and not redis group, I am not sure there is a storm solution if redis is your partial data storage. It's not a relational database so it's not great at joins, aggregations, etc. just my 2c. Time series aggregations in redis are done with 1 key per interval, for example. 2016-06-01: 1:30pm event would execute a counter increment in 2016 key, 2016-06, 2016-06-01, etc down to your smallest interval. Then to pull count for a day you would get 1 key only, 2016-06-01. This approach is fast because all operations are key value based, accessing only 1 key at a time. There is no way to pull data you need at the same time before you store that key into redis? You can use redis as your queue and process it once a minute with a topology, then create a new time based queue key and keep going. You would store your data a bit differently though. Instead of many keys, you would have one key with array of values. You keep pushing into it based on a time stamp, then when it lapses you process it with storm and pop those values out one at a time. Lookup the data you need, keep an aggregate and keep going till the queue is empty.

On May 30, 2016, at 1:17 PM, Daniela S <daniela_4...@gmx.at> wrote:

I have to pull the entries and to add a specific value to every entry. This value is stored in another database and therefore I would like to make the join. based on some conditions, in Storm. I need this value to build the sum, as the entries do not contain any information for the sum.

What would be very few keys?

Thank you and regards,

Daniela

Gesendet: Montag, 30. Mai 2016 um 20:11 Uhr
Von: "Yuri Kostine" <kost...@gmail.com>
An: user@storm.apache.org
Betreff: Re: Pull from Redis

Do you pull entries only to sum them up? Why not keep a running total in redis in a time stamped key by minute? Generally speaking redis is not great for pulling all keys unless there are very few keys.

On May 30, 2016, at 12:49 PM, Daniela S <daniela_4...@gmx.at> wrote:

Hi

I have a topology that stores entries in Redis. Now I would like to pull all entries from Redis every minute or as soon as a value has changed. How can I do that? Can I add another bolt to my topology for this task or do I have to use a spout or even a new topology? I would like to build a sum over all entries every minute. Do you have any advice for that?

Thank you in advance.

Regards,

Daniela

Aw: Re: Re: Re: Pull from Redis

Reply via email to