Hi Flavio,

From what I understand, for the first part you are correct. You can use Flinkā€™s 
internal state to keep your enriched data.
In fact, if you are also querying an external system to enrich your data, it is 
worth looking at the AsyncIO feature:

https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html
 
<https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html>

Now for the second part, currently in Flink you cannot iterate over all 
registered keys for which you have state. A pointer 
to look at the may be useful is the queryable state:

https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/queryable_state.html
 
<https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/queryable_state.html>

This is still an experimental feature, but let us know your opinion if you use 
it.

Finally, an alternative would be to keep state in Flink, and periodically flush 
it to an external storage system, which you can
query at will.

Thanks,
Kostas


> On May 16, 2017, at 4:38 PM, Flavio Pompermaier <pomperma...@okkam.it> wrote:
> 
> Hi to all,
> we're still playing with Flink streaming part in order to see whether it can 
> improve our current batch pipeline.
> At the moment, we have a job that translate incoming data (as Row) into 
> Tuple4, groups them together by the first field and persist the result to 
> disk (using a thrift object). When we need to add tuples to those grouped 
> objects we need to read again the persisted data, flat it back to Tuple4, 
> union with the new tuples, re-group by key and finally persist.
> 
> This is very expansive to do with batch computation while is should pretty 
> straightforward to do with streaming (from what I understood): I just need to 
> use ListState. Right?
> Then, let's say I need to scan all the data of the stateful computation (key 
> and values), in order to do some other computation, I'd like to know:
> how to do that? I.e. create a DataSet/DataSource<Key,Value> from the stateful 
> data in the stream
> is there any problem to access the stateful data without stopping incoming 
> data (and thus possible updates to the states)?
> Thanks in advance for the support,
> Flavio
> 

Reply via email to