Re: How to stream intermediate data that is stored in external storage?

kant kodali Thu, 31 Oct 2019 12:44:18 -0700

Hi Huyen,

That is not my problem statement.  If it is just ingesting from A to B I am
sure there are enough tutorials for me to get it done. I also feel like the
more I elaborate the more confusing it gets and I am not sure why.

I want to join two streams and I want to see/query the results of that join
in real-time but I also have some constraints.

I come from Spark world and in Spark, if I do an inner join of two streams
A & B  then I can see results only when there are matching rows between A &
B (By definition of inner join this makes sense) but if I do an outer join
of two streams A & B I need to specify the time constraint and only when
the time elapses fully I can see the rows that did not match. This means if
my time constraint is one hour I cannot query the intermediate results
until one hour and this is not the behavior I want. I want to be able to do
all sorts of SQL queries on intermediate results.

I like Flink's idea of externalizing the state that way I don't have to
worry about the memory but I am also trying to avoid writing a separate
microservice that needs to poll and display the intermediate results of the
join in real-time. Instead, I am trying to see if there is a way to treat
that constantly evolving intermediate results as a streaming source, and
maybe do some more transformations and push out to another sink.

Hope that makes sense.

Thanks,
Kant

On Thu, Oct 31, 2019 at 2:43 AM Huyen Levan <lvhu...@gmail.com> wrote:

> Hi Kant,
>
> So your problem statement is "ingest 2 streams into a data warehouse". The
> main component of the solution, from my view, is that SQL server. You can
> have a sink function to insert records in your two streams into two
> different tables (A and B), or upsert into one single table C. That upsert
> action itself serves as a join function, there's no need to join in Flink
> at all.
>
> There are many tools out there can be used for that ingestion. Flink, of
> course, can be used for that purpose. But for me, it's an overkill.
>
> Regards,
> Averell
>
> On Thu, 31 Oct. 2019, 8:19 pm kant kodali, <kanth...@gmail.com> wrote:
>
>> Hi Averell,
>>
>> yes,  I want to run ad-hoc SQL queries on the joined data as well as data
>> that may join in the future. For example, let's say if you take datasets A
>> and B in streaming mode a row in A can join with a row B in some time in
>> future let's say but meanwhile if I query the intermediate state using SQL
>> I want the row in A that have not yet joined with B to also be available to
>> Query. so not just joined results alone but also data that might be join in
>> the future as well since its all streaming.
>>
>> Thanks!
>>
>

Re: How to stream intermediate data that is stored in external storage?

Reply via email to