Re: Join DataStream with dimension tables?

Srikanth Wed, 27 Apr 2016 10:35:46 -0700

Aljoscha,

Your thoughts on this?


Srikanth

On Mon, Apr 25, 2016 at 8:08 PM, Srikanth <srikanth...@gmail.com> wrote:

> Aljoscha,
>
> Looks like a potential solution. Feels a bit hacky though.
>
> Didn't quite understand why a list backed store is used to for static
> input buffer? Join(inner) should emit only one record if there is a key
> match.
>
> Is it a property of the system to emit Long.MAX_VALUE watermark when a
> finite stream source ends?
> If so can I do something like this to read static file in parallel?
>     val meta = env.readTextFile("S3:///path/to/file").map(...).keyBy(...)
>
> Shouldn't we also override checkpoint handling of custom operator? If so,
> should the checkpoint wait/fail during the initial read phase?
>
> Lohith,
> Adding a component like Cassandra just for this feels like a overkill. But
> if I can't find a suitable way to do this, I might use it( or Redis
> probably).
>
> Srikanth
>
>
>
> On Fri, Apr 22, 2016 at 12:20 PM, Lohith Samaga M <
> lohith.sam...@mphasis.com> wrote:
>
>> Hi,
>> Cassandra could be used as a distributed cache.
>>
>> Lohith.
>>
>> Sent from my Sony Xperia™ smartphone
>>
>>
>> ---- Aljoscha Krettek wrote ----
>>
>>
>> Hi Srikanth,
>> that's an interesting use case. It's not possible to do something like
>> this out-of-box but I'm actually working on API for such cases.
>>
>> In the mean time, I programmed a short example that shows how something
>> like this can be programmed using the API that is currently available. It
>> requires writing a custom operator but it is still somewhat succinct:
>> https://gist.github.com/aljoscha/c657b98b4017282693a67f1238c88906
>>
>> Please let me know if you have any questions.
>>
>> Cheers,
>> Aljoscha
>>
>> On Thu, 21 Apr 2016 at 03:06 Srikanth <srikanth...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a fairly typical streaming use case but not able to figure how to
>>> implement it best in Flink.
>>> I want to join records read from a kafka stream with one(or more)
>>> dimension tables which are saved as flat files.
>>>
>>> As per this jira <https://issues.apache.org/jira/browse/FLINK-2320> its
>>> not possible to join DataStream with DataSet.
>>> These tables are too big to do a collect() and join.
>>>
>>> It will be good to read these files during startup, do a partitionByHash
>>> and keep it cached.
>>> On the DataStream may be do a keyBy and join.
>>> Is something like this possible?
>>>
>>> Srikanth
>>>
>>
>> Information transmitted by this e-mail is proprietary to Mphasis, its
>> associated companies and/ or its customers and is intended
>> for use only by the individual or entity to which it is addressed, and
>> may contain information that is privileged, confidential or
>> exempt from disclosure under applicable law. If you are not the intended
>> recipient or it appears that this mail has been forwarded
>> to you without proper authority, you are notified that any use or
>> dissemination of this information in any manner is strictly
>> prohibited. In such cases, please notify us immediately at
>> mailmas...@mphasis.com and delete this mail from your records.
>>
>
>

Re: Join DataStream with dimension tables?

Reply via email to