RE: Join Stream with big ref table

2015-11-13 Thread LINZ, Arnaud
Hello,

I’ve worked around my problem by not using the HiveServer2 JDBC driver to read 
the ref table. Apparently, despite all the good options passed to the Statement 
object, it poorly handles RAM, since converting the table into textformat and 
directly reading the hdfs works without any problem and with a lot of free mem…

Greetings,
Arnaud

De : LINZ, Arnaud
Envoyé : jeudi 12 novembre 2015 17:48
À : 'user@flink.apache.org' 
Objet : Join Stream with big ref table

Hello,

I have to enrich a stream with a big reference table (11,000,000 rows). I 
cannot use “join” because I cannot window the stream ; so in the “open()” 
function of each mapper I read the content of the table and put it in a HashMap 
(stored on the heap).

11M rows is quite big but it should take less than 100Mb in RAM, so it’s 
supposed to be easy. However, I systematically run into a Java Out Of Memory 
error, even with huge 64Gb containers (5 slots / container).

Path, ID

Data Port

Last Heartbeat

All Slots

Free Slots

CPU Cores

Physical Memory

Free Memory

Flink Managed Memory

akka.tcp://flink@172.21.125.28:43653/user/taskmanager
4B4D0A725451E933C39E891AAE80B53B

41982

2015-11-12, 17:46:14

5

5

32

126.0 GB

46.0 GB

31.5 GB


I don’t clearly understand why this happens and how to fix it. Any clue?






L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


Multilang Support on Flink

2015-11-13 Thread Welly Tambunan
Hi All,

I want to ask if there's multilang support ( like in Storm and pipeTo in
Spark ) in flink ?

I try to find it in the docs but can't find it.

Any link or direction would be really appreciated.


Cheers

-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Apache Flink Operator State as Query Cache

2015-11-13 Thread Welly Tambunan
Hi Aljoscha,

Thanks for this one. Looking forward for 0.10 release version.

Cheers

On Thu, Nov 12, 2015 at 5:34 PM, Aljoscha Krettek 
wrote:

> Hi,
> I don’t know yet when the operator state will be transitioned to managed
> memory but it could happen for 1.0 (which will come after 0.10). The good
> thing is that the interfaces won’t change, so state can be used as it is
> now.
>
> For 0.10, the release vote is winding down right now, so you can expect
> the release to happen today or tomorrow. I think the streaming is
> production ready now, we expect to mostly to hardening and some
> infrastructure changes (for example annotations that specify API stability)
> for the 1.0 release.
>
> Let us know if you need more information.
>
> Cheers,
> Aljoscha
> > On 12 Nov 2015, at 02:42, Welly Tambunan  wrote:
> >
> > Hi Stephan,
> >
> > >Storing the model in OperatorState is a good idea, if you can. On the
> roadmap is to migrate the operator state to managed memory as well, so that
> should take care of the GC issues.
> > Is this using off the heap memory ? Which version we expect this one to
> be available ?
> >
> > Another question is when will the release version of 0.10 will be out ?
> We would love to upgrade to that one when it's available. That version will
> be a production ready streaming right ?
> >
> >
> >
> >
> >
> > On Wed, Nov 11, 2015 at 4:49 PM, Stephan Ewen  wrote:
> > Hi!
> >
> > In general, if you can keep state in Flink, you get better
> throughput/latency/consistency and have one less system to worry about
> (external k/v store). State outside means that the Flink processes can be
> slimmer and need fewer resources and as such recover a bit faster. There
> are use cases for that as well.
> >
> > Storing the model in OperatorState is a good idea, if you can. On the
> roadmap is to migrate the operator state to managed memory as well, so that
> should take care of the GC issues.
> >
> > We are just adding functionality to make the Key/Value operator state
> usable in CoMap/CoFlatMap as well (currently it only works in windows and
> in Map/FlatMap/Filter functions over the KeyedStream).
> > Until the, you should be able to use a simple Java HashMap and use the
> "Checkpointed" interface to get it persistent.
> >
> > Greetings,
> > Stephan
> >
> >
> > On Sun, Nov 8, 2015 at 10:11 AM, Welly Tambunan 
> wrote:
> > Thanks for the answer.
> >
> > Currently the approach that i'm using right now is creating a
> base/marker interface to stream different type of message to the same
> operator. Not sure about the performance hit about this compare to the
> CoFlatMap function.
> >
> > Basically this one is providing query cache, so i'm thinking instead of
> using in memory cache like redis, ignite etc, i can just use operator state
> for this one.
> >
> > I just want to gauge do i need to use memory cache or operator state
> would be just fine.
> >
> > However i'm concern about the Gen 2 Garbage Collection for caching our
> own state without using operator state. Is there any clarification on that
> one ?
> >
> >
> >
> > On Sat, Nov 7, 2015 at 12:38 AM, Anwar Rizal 
> wrote:
> >
> > Let me understand your case better here. You have a stream of model and
> stream of data. To process the data, you will need a way to access your
> model from the subsequent stream operations (map, filter, flatmap, ..).
> > I'm not sure in which case Operator State is a good choice, but I think
> you can also live without.
> >
> > val modelStream =  // get the model stream
> > val dataStream   =
> >
> > modelStream.broadcast.connect(dataStream). coFlatMap(  ) Then you can
> keep the latest model in a CoFlatMapRichFunction, not necessarily as
> Operator State, although maybe OperatorState is a good choice too.
> >
> > Does it make sense to you ?
> >
> > Anwar
> >
> > On Fri, Nov 6, 2015 at 10:21 AM, Welly Tambunan 
> wrote:
> > Hi All,
> >
> > We have a high density data that required a downsample. However this
> downsample model is very flexible based on the client device and user
> interaction. So it will be wasteful to precompute and store to db.
> >
> > So we want to use Apache Flink to do downsampling and cache the result
> for subsequent query.
> >
> > We are considering using Flink Operator state for that one.
> >
> > Is that the right approach to use that for memory cache ? Or if that
> preferable using memory cache like redis etc.
> >
> > Any comments will be appreciated.
> >
> >
> > Cheers
> > --
> > Welly Tambunan
> > Triplelands
> >
> > http://weltam.wordpress.com
> > http://www.triplelands.com
> >
> >
> >
> >
> > --
> > Welly Tambunan
> > Triplelands
> >
> > http://weltam.wordpress.com
> > http://www.triplelands.com
> >
> >
> >
> >
> > --
> > Welly Tambunan
> > Triplelands
> >
> > http://weltam.wordpress.com
> > http://www.triplelands.com
>
>


-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Flink, Kappa and Lambda

2015-11-13 Thread Welly Tambunan
Hi rss rss,

Yes. I have already read that book.

However given the state of streaming right now, and Kappa Architecture, I
don't think we need Lambda Architecture again ?

Any thoughts ?

On Thu, Nov 12, 2015 at 12:29 PM, rss rss  wrote:

> Hello,
>
>   regarding the Lambda architecture there is a following book -
> https://www.manning.com/books/big-data (Big Data. Principles and best
> practices of scalable realtime data systems
>  Nathan Marz and James Warren).
>
> Regards,
> Roman
>
> 2015-11-12 4:47 GMT+03:00 Welly Tambunan :
>
>> Hi Stephan,
>>
>>
>> Thanks for your response.
>>
>>
>> We are trying to justify whether it's enough to use Kappa Architecture
>> with Flink. This more about resiliency and message lost issue etc.
>>
>> The article is worry about message lost even if you are using Kafka.
>>
>> No matter the message queue or broker you rely on whether it be RabbitMQ,
>> JMS, ActiveMQ, Websphere, MSMQ and yes even Kafka you can lose messages in
>> any of the following ways:
>>
>>- A downstream system from the broker can have data loss
>>- All message queues today can lose already acknowledged messages
>>during failover or leader election.
>>- A bug can send the wrong messages to the wrong systems.
>>
>> Cheers
>>
>> On Wed, Nov 11, 2015 at 4:13 PM, Stephan Ewen  wrote:
>>
>>> Hi!
>>>
>>> Can you explain a little more what you want to achieve? Maybe then we
>>> can give a few more comments...
>>>
>>> I briefly read through some of the articles you linked, but did not
>>> quite understand their train of thoughts.
>>> For example, letting Tomcat write to Cassandra directly, and to Kafka,
>>> might just be redundant. Why not let the streaming job that reads the Kafka
>>> queue
>>> move the data to Cassandra as one of its results? Further more, durable
>>> storing the sequence of events is exactly what Kafka does, but the article
>>> suggests to use Cassandra for that, which I find very counter intuitive.
>>> It looks a bit like the suggested approach is only adopting streaming for
>>> half the task.
>>>
>>> Greetings,
>>> Stephan
>>>
>>>
>>> On Tue, Nov 10, 2015 at 7:49 AM, Welly Tambunan 
>>> wrote:
>>>
 Hi All,

 I read a couple of article about Kappa and Lambda Architecture.


 http://www.confluent.io/blog/real-time-stream-processing-the-next-step-for-apache-flink/

 I'm convince that Flink will simplify this one with streaming.

 However i also stumble upon this blog post that has valid argument to
 have a system of record storage ( event sourcing ) and finally lambda
 architecture is appear at the solution. Basically it will write twice to
 Queuing system and C* for safety. System of record here is basically
 storing the event (delta).

 [image: Inline image 1]


 https://lostechies.com/ryansvihla/2015/09/17/event-sourcing-and-system-of-record-sane-distributed-development-in-the-modern-era-2/

 Another approach is about lambda architecture for maintaining the
 correctness of the system.


 https://lostechies.com/ryansvihla/2015/09/17/real-time-analytics-with-spark-streaming-and-cassandra/


 Given that he's using Spark for the streaming processor, do we have to
 do the same thing with Apache Flink ?



 Cheers
 --
 Welly Tambunan
 Triplelands

 http://weltam.wordpress.com
 http://www.triplelands.com 

>>>
>>>
>>
>>
>> --
>> Welly Tambunan
>> Triplelands
>>
>> http://weltam.wordpress.com
>> http://www.triplelands.com 
>>
>
>


-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Join Stream with big ref table

2015-11-13 Thread Robert Metzger
Hi Arnaud,

I'm happy that you were able to resolve the issue. If you are still
interested in the first approach, you could try some things, for example
using only one slot per task manager (the slots share the heap of the TM).

Regards,
Robert

On Fri, Nov 13, 2015 at 9:18 AM, LINZ, Arnaud 
wrote:

> Hello,
>
>
>
> I’ve worked around my problem by not using the HiveServer2 JDBC driver to
> read the ref table. Apparently, despite all the good options passed to the
> Statement object, it poorly handles RAM, since converting the table into
> textformat and directly reading the hdfs works without any problem and with
> a lot of free mem…
>
>
>
> Greetings,
>
> Arnaud
>
>
>
> *De :* LINZ, Arnaud
> *Envoyé :* jeudi 12 novembre 2015 17:48
> *À :* 'user@flink.apache.org' 
> *Objet :* Join Stream with big ref table
>
>
>
> Hello,
>
>
>
> I have to enrich a stream with a big reference table (11,000,000 rows). I
> cannot use “join” because I cannot window the stream ; so in the “open()”
> function of each mapper I read the content of the table and put it in a
> HashMap (stored on the heap).
>
>
>
> 11M rows is quite big but it should take less than 100Mb in RAM, so it’s
> supposed to be easy. However, I systematically run into a Java Out Of
> Memory error, even with huge 64Gb containers (5 slots / container).
>
>
>
> Path, ID
>
> Data Port
>
> Last Heartbeat
>
> All Slots
>
> Free Slots
>
> CPU Cores
>
> Physical Memory
>
> Free Memory
>
> Flink Managed Memory
>
> akka.tcp://flink@172.21.125.28:43653/user/taskmanager
>
> 4B4D0A725451E933C39E891AAE80B53B
>
> 41982
>
> 2015-11-12, 17:46:14
>
> 5
>
> 5
>
> 32
>
> 126.0 GB
>
> 46.0 GB
>
> 31.5 GB
>
>
>
> I don’t clearly understand why this happens and how to fix it. Any clue?
>
>
>
>
>
>
>
> --
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si
> vous n'êtes pas destinataire de ce message, merci de le détruire et
> d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The
> company that sent this message cannot therefore be held liable for its
> content nor attachments. Any unauthorized use or dissemination is
> prohibited. If you are not the intended recipient of this message, then
> please delete it and notify the sender.
>


Re: Flink, Kappa and Lambda

2015-11-13 Thread Welly Tambunan
Hi Nick,

I totally agree with your point.

My concern is the Kafka, is the author concern really true ? Any one can
give comments on this one ?



On Thu, Nov 12, 2015 at 12:33 PM, Nick Dimiduk  wrote:

> The first and 3rd points here aren't very fair -- they apply to all data
> systems. Systems downstream of your database can lose data in the same way;
> the database retention policy expires old data, downstream fails, and back
> to the tapes you must go. Likewise with 3, a bug in any ETL system can
> cause problems. Also not specific to streaming in general or Kafka/Flink
> specifically.
>
> I'm much more curious about the 2nd claim. The whole point of high
> availability in these systems is to not lose data during failure. The
> post's author is not specific on any of these points, but just like I look
> to a distributed database community to prove to me it doesn't lose data in
> these corner cases, so too do I expect Kafka to prove it is resilient. In
> the absence of software formally proven correct, I look to empirical
> evidence in the form of chaos monkey type tests.
>
>
> On Wednesday, November 11, 2015, Welly Tambunan  wrote:
>
>> Hi Stephan,
>>
>>
>> Thanks for your response.
>>
>>
>> We are trying to justify whether it's enough to use Kappa Architecture
>> with Flink. This more about resiliency and message lost issue etc.
>>
>> The article is worry about message lost even if you are using Kafka.
>>
>> No matter the message queue or broker you rely on whether it be RabbitMQ,
>> JMS, ActiveMQ, Websphere, MSMQ and yes even Kafka you can lose messages in
>> any of the following ways:
>>
>>- A downstream system from the broker can have data loss
>>- All message queues today can lose already acknowledged messages
>>during failover or leader election.
>>- A bug can send the wrong messages to the wrong systems.
>>
>> Cheers
>>
>> On Wed, Nov 11, 2015 at 4:13 PM, Stephan Ewen  wrote:
>>
>>> Hi!
>>>
>>> Can you explain a little more what you want to achieve? Maybe then we
>>> can give a few more comments...
>>>
>>> I briefly read through some of the articles you linked, but did not
>>> quite understand their train of thoughts.
>>> For example, letting Tomcat write to Cassandra directly, and to Kafka,
>>> might just be redundant. Why not let the streaming job that reads the Kafka
>>> queue
>>> move the data to Cassandra as one of its results? Further more, durable
>>> storing the sequence of events is exactly what Kafka does, but the article
>>> suggests to use Cassandra for that, which I find very counter intuitive.
>>> It looks a bit like the suggested approach is only adopting streaming for
>>> half the task.
>>>
>>> Greetings,
>>> Stephan
>>>
>>>
>>> On Tue, Nov 10, 2015 at 7:49 AM, Welly Tambunan 
>>> wrote:
>>>
 Hi All,

 I read a couple of article about Kappa and Lambda Architecture.


 http://www.confluent.io/blog/real-time-stream-processing-the-next-step-for-apache-flink/

 I'm convince that Flink will simplify this one with streaming.

 However i also stumble upon this blog post that has valid argument to
 have a system of record storage ( event sourcing ) and finally lambda
 architecture is appear at the solution. Basically it will write twice to
 Queuing system and C* for safety. System of record here is basically
 storing the event (delta).

 [image: Inline image 1]


 https://lostechies.com/ryansvihla/2015/09/17/event-sourcing-and-system-of-record-sane-distributed-development-in-the-modern-era-2/

 Another approach is about lambda architecture for maintaining the
 correctness of the system.


 https://lostechies.com/ryansvihla/2015/09/17/real-time-analytics-with-spark-streaming-and-cassandra/


 Given that he's using Spark for the streaming processor, do we have to
 do the same thing with Apache Flink ?



 Cheers
 --
 Welly Tambunan
 Triplelands

 http://weltam.wordpress.com
 http://www.triplelands.com 

>>>
>>>
>>
>>
>> --
>> Welly Tambunan
>> Triplelands
>>
>> http://weltam.wordpress.com
>> http://www.triplelands.com 
>>
>


-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Apache Flink Operator State as Query Cache

2015-11-13 Thread Robert Metzger
Hi Welly,
Flink 0.10.0 is out, its just not announced yet.
Its available on maven central and the global mirrors are currently syncing
it. This mirror for example has the update already:
http://apache.mirror.digionline.de/flink/flink-0.10.0/

On Fri, Nov 13, 2015 at 9:50 AM, Welly Tambunan  wrote:

> Hi Aljoscha,
>
> Thanks for this one. Looking forward for 0.10 release version.
>
> Cheers
>
> On Thu, Nov 12, 2015 at 5:34 PM, Aljoscha Krettek 
> wrote:
>
>> Hi,
>> I don’t know yet when the operator state will be transitioned to managed
>> memory but it could happen for 1.0 (which will come after 0.10). The good
>> thing is that the interfaces won’t change, so state can be used as it is
>> now.
>>
>> For 0.10, the release vote is winding down right now, so you can expect
>> the release to happen today or tomorrow. I think the streaming is
>> production ready now, we expect to mostly to hardening and some
>> infrastructure changes (for example annotations that specify API stability)
>> for the 1.0 release.
>>
>> Let us know if you need more information.
>>
>> Cheers,
>> Aljoscha
>> > On 12 Nov 2015, at 02:42, Welly Tambunan  wrote:
>> >
>> > Hi Stephan,
>> >
>> > >Storing the model in OperatorState is a good idea, if you can. On the
>> roadmap is to migrate the operator state to managed memory as well, so that
>> should take care of the GC issues.
>> > Is this using off the heap memory ? Which version we expect this one to
>> be available ?
>> >
>> > Another question is when will the release version of 0.10 will be out ?
>> We would love to upgrade to that one when it's available. That version will
>> be a production ready streaming right ?
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Nov 11, 2015 at 4:49 PM, Stephan Ewen  wrote:
>> > Hi!
>> >
>> > In general, if you can keep state in Flink, you get better
>> throughput/latency/consistency and have one less system to worry about
>> (external k/v store). State outside means that the Flink processes can be
>> slimmer and need fewer resources and as such recover a bit faster. There
>> are use cases for that as well.
>> >
>> > Storing the model in OperatorState is a good idea, if you can. On the
>> roadmap is to migrate the operator state to managed memory as well, so that
>> should take care of the GC issues.
>> >
>> > We are just adding functionality to make the Key/Value operator state
>> usable in CoMap/CoFlatMap as well (currently it only works in windows and
>> in Map/FlatMap/Filter functions over the KeyedStream).
>> > Until the, you should be able to use a simple Java HashMap and use the
>> "Checkpointed" interface to get it persistent.
>> >
>> > Greetings,
>> > Stephan
>> >
>> >
>> > On Sun, Nov 8, 2015 at 10:11 AM, Welly Tambunan 
>> wrote:
>> > Thanks for the answer.
>> >
>> > Currently the approach that i'm using right now is creating a
>> base/marker interface to stream different type of message to the same
>> operator. Not sure about the performance hit about this compare to the
>> CoFlatMap function.
>> >
>> > Basically this one is providing query cache, so i'm thinking instead of
>> using in memory cache like redis, ignite etc, i can just use operator state
>> for this one.
>> >
>> > I just want to gauge do i need to use memory cache or operator state
>> would be just fine.
>> >
>> > However i'm concern about the Gen 2 Garbage Collection for caching our
>> own state without using operator state. Is there any clarification on that
>> one ?
>> >
>> >
>> >
>> > On Sat, Nov 7, 2015 at 12:38 AM, Anwar Rizal 
>> wrote:
>> >
>> > Let me understand your case better here. You have a stream of model and
>> stream of data. To process the data, you will need a way to access your
>> model from the subsequent stream operations (map, filter, flatmap, ..).
>> > I'm not sure in which case Operator State is a good choice, but I think
>> you can also live without.
>> >
>> > val modelStream =  // get the model stream
>> > val dataStream   =
>> >
>> > modelStream.broadcast.connect(dataStream). coFlatMap(  ) Then you can
>> keep the latest model in a CoFlatMapRichFunction, not necessarily as
>> Operator State, although maybe OperatorState is a good choice too.
>> >
>> > Does it make sense to you ?
>> >
>> > Anwar
>> >
>> > On Fri, Nov 6, 2015 at 10:21 AM, Welly Tambunan 
>> wrote:
>> > Hi All,
>> >
>> > We have a high density data that required a downsample. However this
>> downsample model is very flexible based on the client device and user
>> interaction. So it will be wasteful to precompute and store to db.
>> >
>> > So we want to use Apache Flink to do downsampling and cache the result
>> for subsequent query.
>> >
>> > We are considering using Flink Operator state for that one.
>> >
>> > Is that the right approach to use that for memory cache ? Or if that
>> preferable using memory cache like redis etc.
>> >
>> > Any comments will be appreciated.
>> >
>> >
>> > Cheers
>> > --
>> > Welly Tambunan
>> > Triplelands
>> >
>> > http://weltam.wordpress.com
>> > http://www

Re: Apache Flink Operator State as Query Cache

2015-11-13 Thread Welly Tambunan
Awesome !

This is really the best weekend gift ever. :)

Cheers

On Fri, Nov 13, 2015 at 3:54 PM, Robert Metzger  wrote:

> Hi Welly,
> Flink 0.10.0 is out, its just not announced yet.
> Its available on maven central and the global mirrors are currently
> syncing it. This mirror for example has the update already:
> http://apache.mirror.digionline.de/flink/flink-0.10.0/
>
> On Fri, Nov 13, 2015 at 9:50 AM, Welly Tambunan  wrote:
>
>> Hi Aljoscha,
>>
>> Thanks for this one. Looking forward for 0.10 release version.
>>
>> Cheers
>>
>> On Thu, Nov 12, 2015 at 5:34 PM, Aljoscha Krettek 
>> wrote:
>>
>>> Hi,
>>> I don’t know yet when the operator state will be transitioned to managed
>>> memory but it could happen for 1.0 (which will come after 0.10). The good
>>> thing is that the interfaces won’t change, so state can be used as it is
>>> now.
>>>
>>> For 0.10, the release vote is winding down right now, so you can expect
>>> the release to happen today or tomorrow. I think the streaming is
>>> production ready now, we expect to mostly to hardening and some
>>> infrastructure changes (for example annotations that specify API stability)
>>> for the 1.0 release.
>>>
>>> Let us know if you need more information.
>>>
>>> Cheers,
>>> Aljoscha
>>> > On 12 Nov 2015, at 02:42, Welly Tambunan  wrote:
>>> >
>>> > Hi Stephan,
>>> >
>>> > >Storing the model in OperatorState is a good idea, if you can. On the
>>> roadmap is to migrate the operator state to managed memory as well, so that
>>> should take care of the GC issues.
>>> > Is this using off the heap memory ? Which version we expect this one
>>> to be available ?
>>> >
>>> > Another question is when will the release version of 0.10 will be out
>>> ? We would love to upgrade to that one when it's available. That version
>>> will be a production ready streaming right ?
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Wed, Nov 11, 2015 at 4:49 PM, Stephan Ewen 
>>> wrote:
>>> > Hi!
>>> >
>>> > In general, if you can keep state in Flink, you get better
>>> throughput/latency/consistency and have one less system to worry about
>>> (external k/v store). State outside means that the Flink processes can be
>>> slimmer and need fewer resources and as such recover a bit faster. There
>>> are use cases for that as well.
>>> >
>>> > Storing the model in OperatorState is a good idea, if you can. On the
>>> roadmap is to migrate the operator state to managed memory as well, so that
>>> should take care of the GC issues.
>>> >
>>> > We are just adding functionality to make the Key/Value operator state
>>> usable in CoMap/CoFlatMap as well (currently it only works in windows and
>>> in Map/FlatMap/Filter functions over the KeyedStream).
>>> > Until the, you should be able to use a simple Java HashMap and use the
>>> "Checkpointed" interface to get it persistent.
>>> >
>>> > Greetings,
>>> > Stephan
>>> >
>>> >
>>> > On Sun, Nov 8, 2015 at 10:11 AM, Welly Tambunan 
>>> wrote:
>>> > Thanks for the answer.
>>> >
>>> > Currently the approach that i'm using right now is creating a
>>> base/marker interface to stream different type of message to the same
>>> operator. Not sure about the performance hit about this compare to the
>>> CoFlatMap function.
>>> >
>>> > Basically this one is providing query cache, so i'm thinking instead
>>> of using in memory cache like redis, ignite etc, i can just use operator
>>> state for this one.
>>> >
>>> > I just want to gauge do i need to use memory cache or operator state
>>> would be just fine.
>>> >
>>> > However i'm concern about the Gen 2 Garbage Collection for caching our
>>> own state without using operator state. Is there any clarification on that
>>> one ?
>>> >
>>> >
>>> >
>>> > On Sat, Nov 7, 2015 at 12:38 AM, Anwar Rizal 
>>> wrote:
>>> >
>>> > Let me understand your case better here. You have a stream of model
>>> and stream of data. To process the data, you will need a way to access your
>>> model from the subsequent stream operations (map, filter, flatmap, ..).
>>> > I'm not sure in which case Operator State is a good choice, but I
>>> think you can also live without.
>>> >
>>> > val modelStream =  // get the model stream
>>> > val dataStream   =
>>> >
>>> > modelStream.broadcast.connect(dataStream). coFlatMap(  ) Then you can
>>> keep the latest model in a CoFlatMapRichFunction, not necessarily as
>>> Operator State, although maybe OperatorState is a good choice too.
>>> >
>>> > Does it make sense to you ?
>>> >
>>> > Anwar
>>> >
>>> > On Fri, Nov 6, 2015 at 10:21 AM, Welly Tambunan 
>>> wrote:
>>> > Hi All,
>>> >
>>> > We have a high density data that required a downsample. However this
>>> downsample model is very flexible based on the client device and user
>>> interaction. So it will be wasteful to precompute and store to db.
>>> >
>>> > So we want to use Apache Flink to do downsampling and cache the result
>>> for subsequent query.
>>> >
>>> > We are considering using Flink Operator state for that one.
>>> >
>>> > Is that the right a

Hello World Flink 0.10

2015-11-13 Thread Kamil Gorlo
Hi guys,

I was trying to implement Hello World from slide 36 from
http://www.slideshare.net/FlinkForward/k-tzoumas-s-ewen-flink-forward-keynote

but I have problem with EOFTrigger - is it something I should implement by
myself? I cannot find it in Flink libraries.

Cheers,
Kamil


Re: Hello World Flink 0.10

2015-11-13 Thread Robert Metzger
Hi Kamil,

The EOFTrigger is not part of Flink.
However, I've also tried implementing the Hello World from the presentation
here:
https://github.com/rmetzger/scratch/blob/flink0.10-scala2.11/src/main/scala/com/dataartisans/Job.scala

Stephan Ewen told me that there is a more elegant way of implementing this:
The trigger is actually not needed because windows trigger when they are
closed (and they are closed when the file has been read). So you can also
remove my hack of emitting a fake record at the end of the read process in
the close() method.



On Fri, Nov 13, 2015 at 10:52 AM, Kamil Gorlo  wrote:

> Hi guys,
>
> I was trying to implement Hello World from slide 36 from
> http://www.slideshare.net/FlinkForward/k-tzoumas-s-ewen-flink-forward-keynote
>
> but I have problem with EOFTrigger - is it something I should implement by
> myself? I cannot find it in Flink libraries.
>
> Cheers,
> Kamil
>


Re: Multilang Support on Flink

2015-11-13 Thread Maximilian Michels
Hi Welly,

There is a protocol for communicating with other processes. This is
reflected in flink-language-binding-generic module. I'm not aware how
Spark or Storm communication protocols work but this protocol is
rather low level.

Cheers,
Max

On Fri, Nov 13, 2015 at 9:49 AM, Welly Tambunan  wrote:
> Hi All,
>
> I want to ask if there's multilang support ( like in Storm and pipeTo in
> Spark ) in flink ?
>
> I try to find it in the docs but can't find it.
>
> Any link or direction would be really appreciated.
>
>
> Cheers
>
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com


Apache Flink Forward Videos

2015-11-13 Thread Welly Tambunan
Hi All,

I've just notice that the video has already available for this one.

http://flink-forward.org/?post_type=session


Another weekend gift for all.

Cheers
-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Apache Flink Forward Videos

2015-11-13 Thread Maximilian Michels
Hi Welly,

Thanks for sharing! The videos are coming. They soon will all be available.

Cheers,
Max

On Fri, Nov 13, 2015 at 11:08 AM, Welly Tambunan  wrote:
> Hi All,
>
> I've just notice that the video has already available for this one.
>
> http://flink-forward.org/?post_type=session
>
>
> Another weekend gift for all.
>
> Cheers
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com


Re: Hello World Flink 0.10

2015-11-13 Thread Kamil Gorlo
OK, so as far as I understand the line ".trigger(new EOFTrigger)" is simply
not needed in this case (it looks that it works as expected when I remove
it).

Thanks for your help.

pt., 13.11.2015 o 11:01 użytkownik Robert Metzger 
napisał:

> Hi Kamil,
>
> The EOFTrigger is not part of Flink.
> However, I've also tried implementing the Hello World from the
> presentation here:
> https://github.com/rmetzger/scratch/blob/flink0.10-scala2.11/src/main/scala/com/dataartisans/Job.scala
>
> Stephan Ewen told me that there is a more elegant way of implementing
> this: The trigger is actually not needed because windows trigger when they
> are closed (and they are closed when the file has been read). So you can
> also remove my hack of emitting a fake record at the end of the read
> process in the close() method.
>
>
>
> On Fri, Nov 13, 2015 at 10:52 AM, Kamil Gorlo  wrote:
>
>> Hi guys,
>>
>> I was trying to implement Hello World from slide 36 from
>> http://www.slideshare.net/FlinkForward/k-tzoumas-s-ewen-flink-forward-keynote
>>
>> but I have problem with EOFTrigger - is it something I should implement
>> by myself? I cannot find it in Flink libraries.
>>
>> Cheers,
>> Kamil
>>
>
>


Re: Hello World Flink 0.10

2015-11-13 Thread Robert Metzger
Hi,
yes, you can remove the line with the trigger.
Here is a version of the code without the trigger:
https://github.com/rmetzger/scratch/blob/flink0.10-scala2.11/src/main/scala/com/dataartisans/JobWithoutTrigger.scala

On Fri, Nov 13, 2015 at 11:29 AM, Kamil Gorlo  wrote:

> OK, so as far as I understand the line ".trigger(new EOFTrigger)" is
> simply not needed in this case (it looks that it works as expected when I
> remove it).
>
> Thanks for your help.
>
> pt., 13.11.2015 o 11:01 użytkownik Robert Metzger 
> napisał:
>
>> Hi Kamil,
>>
>> The EOFTrigger is not part of Flink.
>> However, I've also tried implementing the Hello World from the
>> presentation here:
>> https://github.com/rmetzger/scratch/blob/flink0.10-scala2.11/src/main/scala/com/dataartisans/Job.scala
>>
>> Stephan Ewen told me that there is a more elegant way of implementing
>> this: The trigger is actually not needed because windows trigger when they
>> are closed (and they are closed when the file has been read). So you can
>> also remove my hack of emitting a fake record at the end of the read
>> process in the close() method.
>>
>>
>>
>> On Fri, Nov 13, 2015 at 10:52 AM, Kamil Gorlo  wrote:
>>
>>> Hi guys,
>>>
>>> I was trying to implement Hello World from slide 36 from
>>> http://www.slideshare.net/FlinkForward/k-tzoumas-s-ewen-flink-forward-keynote
>>>
>>> but I have problem with EOFTrigger - is it something I should implement
>>> by myself? I cannot find it in Flink libraries.
>>>
>>> Cheers,
>>> Kamil
>>>
>>
>>


Re: Flink, Kappa and Lambda

2015-11-13 Thread Christian Kreutzfeldt
Hi

Personally, I find the the concepts of the so-called Kappa
architecture intriguing. But I doubt that it is applicable in a generic
setup where different use cases are mapped into the architecture. To be
fair, I think the same applies to Lambda architectures. Therefore I
wouldn't assume that Lambda architectures are obsolete with the advent of
Kappa as new architectural paradigm.

>From my point of view, it all depends on the use case that you want to
solve. For example, I saw a presentation given by Eric Tschetter and
Fangjin Yang of MetaMarkets on how they use Hadoop and Druid to drive their
business. They used Hadoop as long-term storage and Druid on the serving
layer to provide up-to-date data into the business by updating it in
sub-second intervals. Regularly they algin both systems to be consistent.
In their case, the Lambda architecture serves their business quite well:
speed achieved through the streaming layer and long time persistence
through the batch layer.

In cases where you - for example - want to create views on customer
sessions by aggregating all events belonging to a single person and use
them to

* serve recommendation systems while the customer is still on your website
and
* keep them persistent in a long-term archive

people tend to build typical Lambda architectures with duplicated
sessionizing code on both layers. From my point of view this is unnecessary
and introduces an additional source of errors. As customer sessions are
created as stream of events, simply implement the logic on your streaming
layer and persist the final session after a timeout in those systems where
you need the data to be present: eg. recommender system receives constant
updates on each new event and the batch layer (Hadoop) receives the
finished session after it timed out.

As Lambda - in most cases - is implemented to do the same thing on both
layers, later merging the results to keep states consistent, the Kappa
architecture introduces an interesting pattern that people often are not
aware of. The idea to persist the stream itself and get rid of other
systems, like RDBMS, NoSQL DBs or any other type of archive software, is
often accepted as cheap way to reduce costs and maintenance efforts.

But I think Kappa does more and may be expanded to other systems than
streaming as well. You keep the data at that system persistent where it
arrived or received a state you expect in subsequent systems. Why should I
convert a stream of tracking events into a static schema and store the data
inside an RDBMS? What if I rely on its nature that data is coming in as
stream and do not want to have it exported/imported as bulk update but have
the same stream replayed later? What about information loss? Being a stream
of events is part of the information as well like the attributes each event
carries.

So, if Kappa is understood as architectural pattern where data is kept and
processed the way it arrived or is expected by subsequent systems, I do not
think that it will ever replace Lambda but it will complement it.

Therefore I would like to give you the advice to look at your use case(s)
and design the architecture as you need it. Do not stick with a certain
pattern but deploy those parts that fit with your use-case. This context is
far too young that it provides you with additional value strictly following
a certain pattern, eg to make it more easier to integrate with third-party
software.

Best
  Christian


2015-11-13 9:51 GMT+01:00 Welly Tambunan :

> Hi rss rss,
>
> Yes. I have already read that book.
>
> However given the state of streaming right now, and Kappa Architecture, I
> don't think we need Lambda Architecture again ?
>
> Any thoughts ?
>
> On Thu, Nov 12, 2015 at 12:29 PM, rss rss  wrote:
>
>> Hello,
>>
>>   regarding the Lambda architecture there is a following book -
>> https://www.manning.com/books/big-data (Big Data. Principles and best
>> practices of scalable realtime data systems
>>  Nathan Marz and James Warren).
>>
>> Regards,
>> Roman
>>
>> 2015-11-12 4:47 GMT+03:00 Welly Tambunan :
>>
>>> Hi Stephan,
>>>
>>>
>>> Thanks for your response.
>>>
>>>
>>> We are trying to justify whether it's enough to use Kappa Architecture
>>> with Flink. This more about resiliency and message lost issue etc.
>>>
>>> The article is worry about message lost even if you are using Kafka.
>>>
>>> No matter the message queue or broker you rely on whether it be
>>> RabbitMQ, JMS, ActiveMQ, Websphere, MSMQ and yes even Kafka you can lose
>>> messages in any of the following ways:
>>>
>>>- A downstream system from the broker can have data loss
>>>- All message queues today can lose already acknowledged messages
>>>during failover or leader election.
>>>- A bug can send the wrong messages to the wrong systems.
>>>
>>> Cheers
>>>
>>> On Wed, Nov 11, 2015 at 4:13 PM, Stephan Ewen  wrote:
>>>
 Hi!

 Can you explain a little more what you want to achieve? Maybe then we
 can give a few more comments

RE: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

2015-11-13 Thread LINZ, Arnaud
Hi Robert,

Thanks, it works with 50% -- at least way past the previous crash point.

In my opinion (I lack real metrics), the part that uses the most memory is the 
M2 mapper, instantiated once per slot.
The most complex part is the Sink (it does use a lot of hdfs files, flushing 
threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only 
once per slot ? I’m really surprised by that memory usage, I will try using a 
monitoring app on the yarn jvm to understand.

How do I set this yarn.heap-cutoff-ratio  parameter for a specific application 
? I don’t want to modify the “root-protected” flink-conf.yaml for all the users 
& flink jobs with that value.

Regards,
Arnaud

De : Robert Metzger [mailto:rmetz...@apache.org]
Envoyé : vendredi 13 novembre 2015 15:16
À : user@flink.apache.org
Objet : Re: Crash in a simple "mapper style" streaming app likely due to a 
memory leak ?

Hi Arnaud,

can you try running the job again with the configuration value of 
"yarn.heap-cutoff-ratio" set to 0.5.
As you can see, the container has been killed because it used more than 12 GB : 
"12.1 GB of 12 GB physical memory used;"
You can also see from the logs, that we limit the JVM Heap space to 9.2GB: 
"java -Xms9216m -Xmx9216m"

In an ideal world, we would tell the JVM to limit its memory usage to 12 GB, 
but sadly, the heap space is not the only memory the JVM is allocating. Its 
allocating direct memory, and other stuff outside. Therefore, we use only 75% 
of the container memory to the heap.
In your case, I assume that each JVM is having multiple HDFS clients, a lot of 
local threads etc that's why the memory might not suffice.
With a cutoff ratio of 0.5, we'll only use 6 GB for the heap.

That value might be a bit too high .. but I want to make sure that we first 
identify the issue.
If the job is running with 50% cutoff, you can try to reduce it again towards 
25% (that's the default value, unlike the documentation says).

I hope that helps.

Regards,
Robert


On Fri, Nov 13, 2015 at 2:58 PM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hello,

I use the brand new 0.10 version and I have problems running a streaming 
execution. My topology is linear : a custom source SC scans a directory and 
emits hdfs file names ; a first mapper M1 opens the file and emits its lines ; 
a filter F filters lines ; another mapper M2 transforms them ; and a 
mapper/sink M3->SK stores them in HDFS.

SC->M1->F->M2->M3->SK

The M2 transformer uses a bit of RAM because when it opens it loads a 11M row 
static table inside a hash map to enrich the lines. I use 55 slots on Yarn, 
using 11 containers of 12Gb x 5 slots

To my understanding, I should not have any memory problem since each record is 
independent : no join, no key, no aggregation, no window => it’s a simple flow 
mapper, with RAM simply used as a buffer. However, if I submit enough input 
data, I systematically crash my app with “Connection unexpectedly closed by 
remote task manager” exception, and the first error in YARN log shows that “a 
container is running beyond physical memory limits”.

If I increase the container size, I simply need to feed in more data to get the 
crash happen.

Any idea?

Greetings,
Arnaud

_
Exceptions in Flink dashboard detail :

Root Exception :
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
Connection unexpectedly closed by remote task manager 
'bt1shli6/172.21.125.31:33186'. This might indicate 
that the remote task manager was lost.
   at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)
(…)



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

2015-11-13 Thread Robert Metzger
Hi Arnaud,

your M2 mapper is allocating memory on the JVM heap. That should not cause
any issues, because the heap is limited to 9.2GB anyways. The problem are
offheap allocations.
The RichSinkFunction is only instantiated once per slot, yes.

You can set application specific parameters using
"-Dyarn.heap-cutoff-ratio=0.5".

Regards,
Robert





On Fri, Nov 13, 2015 at 3:49 PM, LINZ, Arnaud 
wrote:

> Hi Robert,
>
>
>
> Thanks, it works with 50% -- at least way past the previous crash point.
>
>
>
> In my opinion (I lack real metrics), the part that uses the most memory is
> the M2 mapper, instantiated once per slot.
>
> The most complex part is the Sink (it does use a lot of hdfs files,
> flushing threads etc.) ; but I expect the “RichSinkFunction” to be
> instantiated only once per slot ? I’m really surprised by that memory
> usage, I will try using a monitoring app on the yarn jvm to understand.
>
>
>
> How do I set this yarn.heap-cutoff-ratio  parameter for a specific
> application ? I don’t want to modify the “root-protected” flink-conf.yaml
> for all the users & flink jobs with that value.
>
>
>
> Regards,
>
> Arnaud
>
>
>
> *De :* Robert Metzger [mailto:rmetz...@apache.org]
> *Envoyé :* vendredi 13 novembre 2015 15:16
> *À :* user@flink.apache.org
> *Objet :* Re: Crash in a simple "mapper style" streaming app likely due
> to a memory leak ?
>
>
>
> Hi Arnaud,
>
>
>
> can you try running the job again with the configuration value
> of "yarn.heap-cutoff-ratio" set to 0.5.
>
> As you can see, the container has been killed because it used more than 12
> GB : "12.1 GB of 12 GB physical memory used;"
> You can also see from the logs, that we limit the JVM Heap space to 9.2GB:
> "java -Xms9216m -Xmx9216m"
>
>
>
> In an ideal world, we would tell the JVM to limit its memory usage to 12
> GB, but sadly, the heap space is not the only memory the JVM is allocating.
> Its allocating direct memory, and other stuff outside. Therefore, we use
> only 75% of the container memory to the heap.
>
> In your case, I assume that each JVM is having multiple HDFS clients, a
> lot of local threads etc that's why the memory might not suffice.
>
> With a cutoff ratio of 0.5, we'll only use 6 GB for the heap.
>
>
>
> That value might be a bit too high .. but I want to make sure that we
> first identify the issue.
>
> If the job is running with 50% cutoff, you can try to reduce it again
> towards 25% (that's the default value, unlike the documentation says).
>
>
>
> I hope that helps.
>
>
>
> Regards,
>
> Robert
>
>
>
>
>
> On Fri, Nov 13, 2015 at 2:58 PM, LINZ, Arnaud 
> wrote:
>
> Hello,
>
>
>
> I use the brand new 0.10 version and I have problems running a streaming
> execution. My topology is linear : a custom source SC scans a directory and
> emits hdfs file names ; a first mapper M1 opens the file and emits its
> lines ; a filter F filters lines ; another mapper M2 transforms them ; and
> a mapper/sink M3->SK stores them in HDFS.
>
>
>
> SC->M1->F->M2->M3->SK
>
>
>
> The M2 transformer uses a bit of RAM because when it opens it loads a 11M
> row static table inside a hash map to enrich the lines. I use 55 slots on
> Yarn, using 11 containers of 12Gb x 5 slots
>
>
>
> To my understanding, I should not have any memory problem since each
> record is independent : no join, no key, no aggregation, no window => it’s
> a simple flow mapper, with RAM simply used as a buffer. However, if I
> submit enough input data, I systematically crash my app with “Connection
> unexpectedly closed by remote task manager” exception, and the first error
> in YARN log shows that “a container is running beyond physical memory
> limits”.
>
>
>
> If I increase the container size, I simply need to feed in more data to
> get the crash happen.
>
>
>
> Any idea?
>
>
>
> Greetings,
>
> Arnaud
>
>
>
> _
>
> Exceptions in Flink dashboard detail :
>
>
>
> Root Exception :
>
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager 'bt1shli6/
> 172.21.125.31:33186'. This might indicate that the remote task manager
> was lost.
>
>at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)
>
> (…)
>
>
> --
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si
> vous n'êtes pas destinataire de ce message, merci de le détruire et
> d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The
> company that sent this message cannot therefore be held liable for its
> content nor attachments. Any unauthorized use or dissemination is
> prohibited. If you are not the intended recipient of this message, then
> please delete it and notify the 

Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

2015-11-13 Thread Ufuk Celebi

> On 13 Nov 2015, at 15:49, LINZ, Arnaud  wrote:
> 
> Hi Robert,
>  
> Thanks, it works with 50% -- at least way past the previous crash point.
>  
> In my opinion (I lack real metrics), the part that uses the most memory is 
> the M2 mapper, instantiated once per slot.
> The most complex part is the Sink (it does use a lot of hdfs files, flushing 
> threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only 
> once per slot ? I’m really surprised by that memory usage, I will try using a 
> monitoring app on the yarn jvm to understand.

In general it’s instantiated once per subtask. For your current deployment, it 
is one per slot.

– Ufuk



Re: local debug Scala

2015-11-13 Thread rmetzger0
This issue occurs when Flink has been build against another major scala
release.
Did you build Flink yourself or download it from somewhere (maven central) ?



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/local-debug-Scala-tp1764p3481.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Apache Flink Forward Videos

2015-11-13 Thread Welly Tambunan
Thanks Max,

I see that's all the videos has already been there. The keynote has also
been uploaded.

Great stuff !!

Cheers

On Fri, Nov 13, 2015 at 5:12 PM, Maximilian Michels  wrote:

> Hi Welly,
>
> Thanks for sharing! The videos are coming. They soon will all be available.
>
> Cheers,
> Max
>
> On Fri, Nov 13, 2015 at 11:08 AM, Welly Tambunan 
> wrote:
> > Hi All,
> >
> > I've just notice that the video has already available for this one.
> >
> > http://flink-forward.org/?post_type=session
> >
> >
> > Another weekend gift for all.
> >
> > Cheers
> > --
> > Welly Tambunan
> > Triplelands
> >
> > http://weltam.wordpress.com
> > http://www.triplelands.com
>



-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Flink, Kappa and Lambda

2015-11-13 Thread Welly Tambunan
Hi Christian,

Valid point. Thanks a lot for your great explanation.

In our case, we want to avoid lots of moving part. We are working on the
greenfield project so we want to adopt the latest approach as we have a
flexibility right now.
We also asses the Apache Spark before and it doesn't suit our real-time
purpose. But we have a really great experience with Apache Flink until now.

So that's why we trying to do everything as streaming as that has a HUGE
business value for us. And fortunately the technology seems already there
on the streaming world.


Cheers

On Fri, Nov 13, 2015 at 7:21 PM, Christian Kreutzfeldt 
wrote:

> Hi
>
> Personally, I find the the concepts of the so-called Kappa
> architecture intriguing. But I doubt that it is applicable in a generic
> setup where different use cases are mapped into the architecture. To be
> fair, I think the same applies to Lambda architectures. Therefore I
> wouldn't assume that Lambda architectures are obsolete with the advent of
> Kappa as new architectural paradigm.
>
> From my point of view, it all depends on the use case that you want to
> solve. For example, I saw a presentation given by Eric Tschetter and
> Fangjin Yang of MetaMarkets on how they use Hadoop and Druid to drive their
> business. They used Hadoop as long-term storage and Druid on the serving
> layer to provide up-to-date data into the business by updating it in
> sub-second intervals. Regularly they algin both systems to be consistent.
> In their case, the Lambda architecture serves their business quite well:
> speed achieved through the streaming layer and long time persistence
> through the batch layer.
>
> In cases where you - for example - want to create views on customer
> sessions by aggregating all events belonging to a single person and use
> them to
>
> * serve recommendation systems while the customer is still on your website
> and
> * keep them persistent in a long-term archive
>
> people tend to build typical Lambda architectures with duplicated
> sessionizing code on both layers. From my point of view this is unnecessary
> and introduces an additional source of errors. As customer sessions are
> created as stream of events, simply implement the logic on your streaming
> layer and persist the final session after a timeout in those systems where
> you need the data to be present: eg. recommender system receives constant
> updates on each new event and the batch layer (Hadoop) receives the
> finished session after it timed out.
>
> As Lambda - in most cases - is implemented to do the same thing on both
> layers, later merging the results to keep states consistent, the Kappa
> architecture introduces an interesting pattern that people often are not
> aware of. The idea to persist the stream itself and get rid of other
> systems, like RDBMS, NoSQL DBs or any other type of archive software, is
> often accepted as cheap way to reduce costs and maintenance efforts.
>
> But I think Kappa does more and may be expanded to other systems than
> streaming as well. You keep the data at that system persistent where it
> arrived or received a state you expect in subsequent systems. Why should I
> convert a stream of tracking events into a static schema and store the data
> inside an RDBMS? What if I rely on its nature that data is coming in as
> stream and do not want to have it exported/imported as bulk update but have
> the same stream replayed later? What about information loss? Being a stream
> of events is part of the information as well like the attributes each event
> carries.
>
> So, if Kappa is understood as architectural pattern where data is kept and
> processed the way it arrived or is expected by subsequent systems, I do not
> think that it will ever replace Lambda but it will complement it.
>
> Therefore I would like to give you the advice to look at your use case(s)
> and design the architecture as you need it. Do not stick with a certain
> pattern but deploy those parts that fit with your use-case. This context is
> far too young that it provides you with additional value strictly following
> a certain pattern, eg to make it more easier to integrate with third-party
> software.
>
> Best
>   Christian
>
>
> 2015-11-13 9:51 GMT+01:00 Welly Tambunan :
>
>> Hi rss rss,
>>
>> Yes. I have already read that book.
>>
>> However given the state of streaming right now, and Kappa Architecture, I
>> don't think we need Lambda Architecture again ?
>>
>> Any thoughts ?
>>
>> On Thu, Nov 12, 2015 at 12:29 PM, rss rss  wrote:
>>
>>> Hello,
>>>
>>>   regarding the Lambda architecture there is a following book -
>>> https://www.manning.com/books/big-data (Big Data. Principles and best
>>> practices of scalable realtime data systems
>>>  Nathan Marz and James Warren).
>>>
>>> Regards,
>>> Roman
>>>
>>> 2015-11-12 4:47 GMT+03:00 Welly Tambunan :
>>>
 Hi Stephan,


 Thanks for your response.


 We are trying to justify whether it's enough to use Kappa Architecture

Re: Multilang Support on Flink

2015-11-13 Thread Welly Tambunan
Hi Max,

Do you know where the repo is ?

I try to search on the flink staging but seems it's not there anymore ( via
google)

Cheers

On Fri, Nov 13, 2015 at 5:07 PM, Maximilian Michels  wrote:

> Hi Welly,
>
> There is a protocol for communicating with other processes. This is
> reflected in flink-language-binding-generic module. I'm not aware how
> Spark or Storm communication protocols work but this protocol is
> rather low level.
>
> Cheers,
> Max
>
> On Fri, Nov 13, 2015 at 9:49 AM, Welly Tambunan  wrote:
> > Hi All,
> >
> > I want to ask if there's multilang support ( like in Storm and pipeTo in
> > Spark ) in flink ?
> >
> > I try to find it in the docs but can't find it.
> >
> > Any link or direction would be really appreciated.
> >
> >
> > Cheers
> >
> > --
> > Welly Tambunan
> > Triplelands
> >
> > http://weltam.wordpress.com
> > http://www.triplelands.com
>



-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Apache Flink Operator State as Query Cache

2015-11-13 Thread Welly Tambunan
Hi Robert,

Is this version has already handle the stream perfection or out of order
event ?

Any resource on how this work and the API reference ?


Cheers

On Fri, Nov 13, 2015 at 4:00 PM, Welly Tambunan  wrote:

> Awesome !
>
> This is really the best weekend gift ever. :)
>
> Cheers
>
> On Fri, Nov 13, 2015 at 3:54 PM, Robert Metzger 
> wrote:
>
>> Hi Welly,
>> Flink 0.10.0 is out, its just not announced yet.
>> Its available on maven central and the global mirrors are currently
>> syncing it. This mirror for example has the update already:
>> http://apache.mirror.digionline.de/flink/flink-0.10.0/
>>
>> On Fri, Nov 13, 2015 at 9:50 AM, Welly Tambunan 
>> wrote:
>>
>>> Hi Aljoscha,
>>>
>>> Thanks for this one. Looking forward for 0.10 release version.
>>>
>>> Cheers
>>>
>>> On Thu, Nov 12, 2015 at 5:34 PM, Aljoscha Krettek 
>>> wrote:
>>>
 Hi,
 I don’t know yet when the operator state will be transitioned to
 managed memory but it could happen for 1.0 (which will come after 0.10).
 The good thing is that the interfaces won’t change, so state can be used as
 it is now.

 For 0.10, the release vote is winding down right now, so you can expect
 the release to happen today or tomorrow. I think the streaming is
 production ready now, we expect to mostly to hardening and some
 infrastructure changes (for example annotations that specify API stability)
 for the 1.0 release.

 Let us know if you need more information.

 Cheers,
 Aljoscha
 > On 12 Nov 2015, at 02:42, Welly Tambunan  wrote:
 >
 > Hi Stephan,
 >
 > >Storing the model in OperatorState is a good idea, if you can. On
 the roadmap is to migrate the operator state to managed memory as well, so
 that should take care of the GC issues.
 > Is this using off the heap memory ? Which version we expect this one
 to be available ?
 >
 > Another question is when will the release version of 0.10 will be out
 ? We would love to upgrade to that one when it's available. That version
 will be a production ready streaming right ?
 >
 >
 >
 >
 >
 > On Wed, Nov 11, 2015 at 4:49 PM, Stephan Ewen 
 wrote:
 > Hi!
 >
 > In general, if you can keep state in Flink, you get better
 throughput/latency/consistency and have one less system to worry about
 (external k/v store). State outside means that the Flink processes can be
 slimmer and need fewer resources and as such recover a bit faster. There
 are use cases for that as well.
 >
 > Storing the model in OperatorState is a good idea, if you can. On the
 roadmap is to migrate the operator state to managed memory as well, so that
 should take care of the GC issues.
 >
 > We are just adding functionality to make the Key/Value operator state
 usable in CoMap/CoFlatMap as well (currently it only works in windows and
 in Map/FlatMap/Filter functions over the KeyedStream).
 > Until the, you should be able to use a simple Java HashMap and use
 the "Checkpointed" interface to get it persistent.
 >
 > Greetings,
 > Stephan
 >
 >
 > On Sun, Nov 8, 2015 at 10:11 AM, Welly Tambunan 
 wrote:
 > Thanks for the answer.
 >
 > Currently the approach that i'm using right now is creating a
 base/marker interface to stream different type of message to the same
 operator. Not sure about the performance hit about this compare to the
 CoFlatMap function.
 >
 > Basically this one is providing query cache, so i'm thinking instead
 of using in memory cache like redis, ignite etc, i can just use operator
 state for this one.
 >
 > I just want to gauge do i need to use memory cache or operator state
 would be just fine.
 >
 > However i'm concern about the Gen 2 Garbage Collection for caching
 our own state without using operator state. Is there any clarification on
 that one ?
 >
 >
 >
 > On Sat, Nov 7, 2015 at 12:38 AM, Anwar Rizal 
 wrote:
 >
 > Let me understand your case better here. You have a stream of model
 and stream of data. To process the data, you will need a way to access your
 model from the subsequent stream operations (map, filter, flatmap, ..).
 > I'm not sure in which case Operator State is a good choice, but I
 think you can also live without.
 >
 > val modelStream =  // get the model stream
 > val dataStream   =
 >
 > modelStream.broadcast.connect(dataStream). coFlatMap(  ) Then you can
 keep the latest model in a CoFlatMapRichFunction, not necessarily as
 Operator State, although maybe OperatorState is a good choice too.
 >
 > Does it make sense to you ?
 >
 > Anwar
 >
 > On Fri, Nov 6, 2015 at 10:21 AM, Welly Tambunan 
 wrote:
 > Hi All,
 >
 > We have a high density data that required a downsample. However this
>>>