Re: Hardware requirements and learning resources

2015-09-03 Thread Stefan Winterstein

> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training. 

Excellent resources! Somehow, I managed not to stumble over them by
myself - either I was blind, or they are well hidden... :)


Best,
-Stefan



Re: Hardware requirements and learning resources

2015-09-02 Thread Juan Rodríguez Hortalá
Hi Robert and Jay,

Thanks for your answers. The petstore jobs could indeed be used as a roseta
code for Flink and Spark.

Regarding the memory requirements, those are very good news to me, just 2GB
of RAM is certainly a modest amount of memory, you can use even some Single
Board Computers for that. Is there any reference load test programs and
benchmarks that can be used to compare different deployments of Flink?
Maybe the petstore implementation mentioned by Jay could be used for that,
and also to compare the performance of Flink to other systems like Spark or
Hadoop MapReduce, which I understand is the current goal.

Greetings,

Juan


2015-09-02 14:56 GMT+02:00 jay vyas :

> We're also working on a bigpetstore implementation of flink which will
> help onboard spark/mapreduce folks.
>
> I have prototypical code here that runs a simple job in memory,
> contributions welcome,
>
> right now there is a serialization error
> https://github.com/bigpetstore/bigpetstore-flink .
>
> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
> wrote:
>
>> Hi Juan,
>>
>> I think the recommendations in the Spark guide are quite good, and are
>> similar to what I would recommend for Flink as well.
>> Depending on the workloads you are interested to run, you can certainly
>> use Flink with less than 8 GB per machine. I think you can start Flink
>> TaskManagers with 500 MB of heap space and they'll still be able to process
>> some GB of data.
>>
>> Everything above 2 GB is probably good enough for some initial
>> experimentation (again depending on your workloads, network, disk speed
>> etc.)
>>
>>
>>
>>
>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> Flink is quite nimble with hardware requirements; people have run it in
>>> old-ish laptops and also the largest instances available in cloud
>>> providers. I will let others chime in with more details.
>>>
>>> I am not aware of something along the lines of a cheatsheet that you
>>> mention. If you actually try to do this, I would love to see it, and it
>>> might be useful to others as well. Both use similar abstractions at the API
>>> level (i.e., parallel collections), so if you stay true to the functional
>>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>>> internals things should be straightforward. These apply to the batch APIs;
>>> the streaming API in Flink follows a true streaming paradigm, where you get
>>> an unbounded stream of records and operators on these streams.
>>>
>>> Funny that you ask about a video for the DataStream slides. There is a
>>> Flink training happening as we speak, and a video is being recorded right
>>> now :-) Hopefully it will be made available soon.
>>>
>>> Best,
>>> Kostas
>>>
>>>
>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com> wrote:
>>>
 Answering to myself, I have found some nice training material at
 http://dataartisans.github.io/flink-training. There are even videos at
 youtube for some of the slides

   - http://dataartisans.github.io/flink-training/overview/intro.html
 https://www.youtube.com/watch?v=XgC6c4Wiqvs

   -
 http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
 https://www.youtube.com/watch?v=0EARqW15dDk

 The third lecture
 http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
 more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
 but not exactly, and there are more lessons at
 http://dataartisans.github.io/flink-training, for stream processing
 and the table API for which I haven't found a video. Does anyone have
 pointers to the missing videos?

 Greetings,

 Juan

 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com>:

> Hi list,
>
> I'm new to Flink, and I find this project very interesting. I have
> experience with Apache Spark, and for I've seen so far I find that Flink
> provides an API at a similar abstraction level but based on single record
> processing instead of batch processing. I've read in Quora that Flink
> extends stream processing to batch processing, while Spark extends batch
> processing to streaming. Therefore I find Flink specially attractive for
> low latency stream processing. Anyway, I would appreciate if someone could
> give some indication about where I could find a list of hardware
> requirements for the slave nodes in a Flink cluster. Something along the
> lines of
> https://spark.apache.org/docs/latest/hardware-provisioning.html.
> Spark is known for having quite high minimal memory requirements (8GB RAM
> and 8 cores minimum), and I was wondering if it is also the case for 
> Flink.
> Lower memory requirements would be very interesting for building 

Re: Hardware requirements and learning resources

2015-09-02 Thread Juan Rodríguez Hortalá
Answering to myself, I have found some nice training material at
http://dataartisans.github.io/flink-training. There are even videos at
youtube for some of the slides

  - http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

  - http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture
http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but
not exactly, and there are more lessons at
http://dataartisans.github.io/flink-training, for stream processing and the
table API for which I haven't found a video. Does anyone have pointers to
the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
juan.rodriguez.hort...@gmail.com>:

> Hi list,
>
> I'm new to Flink, and I find this project very interesting. I have
> experience with Apache Spark, and for I've seen so far I find that Flink
> provides an API at a similar abstraction level but based on single record
> processing instead of batch processing. I've read in Quora that Flink
> extends stream processing to batch processing, while Spark extends batch
> processing to streaming. Therefore I find Flink specially attractive for
> low latency stream processing. Anyway, I would appreciate if someone could
> give some indication about where I could find a list of hardware
> requirements for the slave nodes in a Flink cluster. Something along the
> lines of https://spark.apache.org/docs/latest/hardware-provisioning.html.
> Spark is known for having quite high minimal memory requirements (8GB RAM
> and 8 cores minimum), and I was wondering if it is also the case for Flink.
> Lower memory requirements would be very interesting for building small
> Flink clusters for educational purposes, or for small projects.
>
> Apart from that, I wonder if there is some blog post by the comunity about
> transitioning from Spark to Flink. I think it could be interesting, as
> there are some similarities in the APIs, but also deep differences in the
> underlying approaches. I was thinking in something like Breeze's cheatsheet
> comparing its matrix operatations with those available in Matlab and Numpy
> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or
> like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also,
> any pointer to some online course, book or training for Flink besides the
> official programming guides would be much appreciated
>
> Thanks in advance for help
>
> Greetings,
>
> Juan
>
>


Re: Hardware requirements and learning resources

2015-09-02 Thread Robert Metzger
Hi Juan,

I think the recommendations in the Spark guide are quite good, and are
similar to what I would recommend for Flink as well.
Depending on the workloads you are interested to run, you can certainly use
Flink with less than 8 GB per machine. I think you can start Flink
TaskManagers with 500 MB of heap space and they'll still be able to process
some GB of data.

Everything above 2 GB is probably good enough for some initial
experimentation (again depending on your workloads, network, disk speed
etc.)




On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas  wrote:

> Hi Juan,
>
> Flink is quite nimble with hardware requirements; people have run it in
> old-ish laptops and also the largest instances available in cloud
> providers. I will let others chime in with more details.
>
> I am not aware of something along the lines of a cheatsheet that you
> mention. If you actually try to do this, I would love to see it, and it
> might be useful to others as well. Both use similar abstractions at the API
> level (i.e., parallel collections), so if you stay true to the functional
> paradigm and not try to "abuse" the system by exploiting knowledge of its
> internals things should be straightforward. These apply to the batch APIs;
> the streaming API in Flink follows a true streaming paradigm, where you get
> an unbounded stream of records and operators on these streams.
>
> Funny that you ask about a video for the DataStream slides. There is a
> Flink training happening as we speak, and a video is being recorded right
> now :-) Hopefully it will be made available soon.
>
> Best,
> Kostas
>
>
> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com> wrote:
>
>> Answering to myself, I have found some nice training material at
>> http://dataartisans.github.io/flink-training. There are even videos at
>> youtube for some of the slides
>>
>>   - http://dataartisans.github.io/flink-training/overview/intro.html
>> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>>
>>   - http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
>> https://www.youtube.com/watch?v=0EARqW15dDk
>>
>> The third lecture
>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
>> more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
>> but not exactly, and there are more lessons at
>> http://dataartisans.github.io/flink-training, for stream processing and
>> the table API for which I haven't found a video. Does anyone have pointers
>> to the missing videos?
>>
>> Greetings,
>>
>> Juan
>>
>> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
>> juan.rodriguez.hort...@gmail.com>:
>>
>>> Hi list,
>>>
>>> I'm new to Flink, and I find this project very interesting. I have
>>> experience with Apache Spark, and for I've seen so far I find that Flink
>>> provides an API at a similar abstraction level but based on single record
>>> processing instead of batch processing. I've read in Quora that Flink
>>> extends stream processing to batch processing, while Spark extends batch
>>> processing to streaming. Therefore I find Flink specially attractive for
>>> low latency stream processing. Anyway, I would appreciate if someone could
>>> give some indication about where I could find a list of hardware
>>> requirements for the slave nodes in a Flink cluster. Something along the
>>> lines of https://spark.apache.org/docs/latest/hardware-provisioning.html.
>>> Spark is known for having quite high minimal memory requirements (8GB RAM
>>> and 8 cores minimum), and I was wondering if it is also the case for Flink.
>>> Lower memory requirements would be very interesting for building small
>>> Flink clusters for educational purposes, or for small projects.
>>>
>>> Apart from that, I wonder if there is some blog post by the comunity
>>> about transitioning from Spark to Flink. I think it could be interesting,
>>> as there are some similarities in the APIs, but also deep differences in
>>> the underlying approaches. I was thinking in something like Breeze's
>>> cheatsheet comparing its matrix operatations with those available in Matlab
>>> and Numpy
>>> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or
>>> like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also,
>>> any pointer to some online course, book or training for Flink besides the
>>> official programming guides would be much appreciated
>>>
>>> Thanks in advance for help
>>>
>>> Greetings,
>>>
>>> Juan
>>>
>>>
>>
>


Re: Hardware requirements and learning resources

2015-09-02 Thread Jay Vyas
Just running the main class is sufficient

> On Sep 2, 2015, at 8:59 AM, Robert Metzger  wrote:
> 
> Hey jay,
> 
> How can I reproduce the error?
> 
>> On Wed, Sep 2, 2015 at 2:56 PM, jay vyas  wrote:
>> We're also working on a bigpetstore implementation of flink which will help 
>> onboard spark/mapreduce folks.
>> 
>> I have prototypical code here that runs a simple job in memory, 
>> contributions welcome,
>> 
>> right now there is a serialization error 
>> https://github.com/bigpetstore/bigpetstore-flink .
>> 
>>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger  wrote:
>>> Hi Juan,
>>> 
>>> I think the recommendations in the Spark guide are quite good, and are 
>>> similar to what I would recommend for Flink as well. 
>>> Depending on the workloads you are interested to run, you can certainly use 
>>> Flink with less than 8 GB per machine. I think you can start Flink 
>>> TaskManagers with 500 MB of heap space and they'll still be able to process 
>>> some GB of data.
>>> 
>>> Everything above 2 GB is probably good enough for some initial 
>>> experimentation (again depending on your workloads, network, disk speed 
>>> etc.)
>>> 
>>> 
>>> 
>>> 
 On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas  wrote:
 Hi Juan,
 
 Flink is quite nimble with hardware requirements; people have run it in 
 old-ish laptops and also the largest instances available in cloud 
 providers. I will let others chime in with more details.
 
 I am not aware of something along the lines of a cheatsheet that you 
 mention. If you actually try to do this, I would love to see it, and it 
 might be useful to others as well. Both use similar abstractions at the 
 API level (i.e., parallel collections), so if you stay true to the 
 functional paradigm and not try to "abuse" the system by exploiting 
 knowledge of its internals things should be straightforward. These apply 
 to the batch APIs; the streaming API in Flink follows a true streaming 
 paradigm, where you get an unbounded stream of records and operators on 
 these streams.
 
 Funny that you ask about a video for the DataStream slides. There is a 
 Flink training happening as we speak, and a video is being recorded right 
 now :-) Hopefully it will be made available soon.
 
 Best,
 Kostas
 
 
> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá 
>  wrote:
> Answering to myself, I have found some nice training material at 
> http://dataartisans.github.io/flink-training. There are even videos at 
> youtube for some of the slides
> 
>   - http://dataartisans.github.io/flink-training/overview/intro.html
> https://www.youtube.com/watch?v=XgC6c4Wiqvs
> 
>   - http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
> https://www.youtube.com/watch?v=0EARqW15dDk
> 
> The third lecture 
> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html 
> more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU 
> but not exactly, and there are more lessons at 
> http://dataartisans.github.io/flink-training, for stream processing and 
> the table API for which I haven't found a video. Does anyone have 
> pointers to the missing videos?
> 
> Greetings, 
> 
> Juan
> 
> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá 
> :
>> Hi list, 
>> 
>> I'm new to Flink, and I find this project very interesting. I have 
>> experience with Apache Spark, and for I've seen so far I find that Flink 
>> provides an API at a similar abstraction level but based on single 
>> record processing instead of batch processing. I've read in Quora that 
>> Flink extends stream processing to batch processing, while Spark extends 
>> batch processing to streaming. Therefore I find Flink specially 
>> attractive for low latency stream processing. Anyway, I would appreciate 
>> if someone could give some indication about where I could find a list of 
>> hardware requirements for the slave nodes in a Flink cluster. Something 
>> along the lines of 
>> https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark 
>> is known for having quite high minimal memory requirements (8GB RAM and 
>> 8 cores minimum), and I was wondering if it is also the case for Flink. 
>> Lower memory requirements would be very interesting for building small 
>> Flink clusters for educational purposes, or for small projects. 
>> 
>> Apart from that, I wonder if there is some blog post by the comunity 
>> about transitioning from Spark to Flink. I think it could be 
>> interesting, as there are some similarities in the APIs, but also deep 
>> 

Re: Hardware requirements and learning resources

2015-09-02 Thread Robert Metzger
Hey jay,

How can I reproduce the error?

On Wed, Sep 2, 2015 at 2:56 PM, jay vyas 
wrote:

> We're also working on a bigpetstore implementation of flink which will
> help onboard spark/mapreduce folks.
>
> I have prototypical code here that runs a simple job in memory,
> contributions welcome,
>
> right now there is a serialization error
> https://github.com/bigpetstore/bigpetstore-flink .
>
> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
> wrote:
>
>> Hi Juan,
>>
>> I think the recommendations in the Spark guide are quite good, and are
>> similar to what I would recommend for Flink as well.
>> Depending on the workloads you are interested to run, you can certainly
>> use Flink with less than 8 GB per machine. I think you can start Flink
>> TaskManagers with 500 MB of heap space and they'll still be able to process
>> some GB of data.
>>
>> Everything above 2 GB is probably good enough for some initial
>> experimentation (again depending on your workloads, network, disk speed
>> etc.)
>>
>>
>>
>>
>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> Flink is quite nimble with hardware requirements; people have run it in
>>> old-ish laptops and also the largest instances available in cloud
>>> providers. I will let others chime in with more details.
>>>
>>> I am not aware of something along the lines of a cheatsheet that you
>>> mention. If you actually try to do this, I would love to see it, and it
>>> might be useful to others as well. Both use similar abstractions at the API
>>> level (i.e., parallel collections), so if you stay true to the functional
>>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>>> internals things should be straightforward. These apply to the batch APIs;
>>> the streaming API in Flink follows a true streaming paradigm, where you get
>>> an unbounded stream of records and operators on these streams.
>>>
>>> Funny that you ask about a video for the DataStream slides. There is a
>>> Flink training happening as we speak, and a video is being recorded right
>>> now :-) Hopefully it will be made available soon.
>>>
>>> Best,
>>> Kostas
>>>
>>>
>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com> wrote:
>>>
 Answering to myself, I have found some nice training material at
 http://dataartisans.github.io/flink-training. There are even videos at
 youtube for some of the slides

   - http://dataartisans.github.io/flink-training/overview/intro.html
 https://www.youtube.com/watch?v=XgC6c4Wiqvs

   -
 http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
 https://www.youtube.com/watch?v=0EARqW15dDk

 The third lecture
 http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
 more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
 but not exactly, and there are more lessons at
 http://dataartisans.github.io/flink-training, for stream processing
 and the table API for which I haven't found a video. Does anyone have
 pointers to the missing videos?

 Greetings,

 Juan

 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com>:

> Hi list,
>
> I'm new to Flink, and I find this project very interesting. I have
> experience with Apache Spark, and for I've seen so far I find that Flink
> provides an API at a similar abstraction level but based on single record
> processing instead of batch processing. I've read in Quora that Flink
> extends stream processing to batch processing, while Spark extends batch
> processing to streaming. Therefore I find Flink specially attractive for
> low latency stream processing. Anyway, I would appreciate if someone could
> give some indication about where I could find a list of hardware
> requirements for the slave nodes in a Flink cluster. Something along the
> lines of
> https://spark.apache.org/docs/latest/hardware-provisioning.html.
> Spark is known for having quite high minimal memory requirements (8GB RAM
> and 8 cores minimum), and I was wondering if it is also the case for 
> Flink.
> Lower memory requirements would be very interesting for building small
> Flink clusters for educational purposes, or for small projects.
>
> Apart from that, I wonder if there is some blog post by the comunity
> about transitioning from Spark to Flink. I think it could be interesting,
> as there are some similarities in the APIs, but also deep differences in
> the underlying approaches. I was thinking in something like Breeze's
> cheatsheet comparing its matrix operatations with those available in 
> Matlab
> and Numpy
> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet,
> or like 

Re: Hardware requirements and learning resources

2015-09-02 Thread jay vyas
We're also working on a bigpetstore implementation of flink which will help
onboard spark/mapreduce folks.

I have prototypical code here that runs a simple job in memory,
contributions welcome,

right now there is a serialization error
https://github.com/bigpetstore/bigpetstore-flink .

On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger  wrote:

> Hi Juan,
>
> I think the recommendations in the Spark guide are quite good, and are
> similar to what I would recommend for Flink as well.
> Depending on the workloads you are interested to run, you can certainly
> use Flink with less than 8 GB per machine. I think you can start Flink
> TaskManagers with 500 MB of heap space and they'll still be able to process
> some GB of data.
>
> Everything above 2 GB is probably good enough for some initial
> experimentation (again depending on your workloads, network, disk speed
> etc.)
>
>
>
>
> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
> wrote:
>
>> Hi Juan,
>>
>> Flink is quite nimble with hardware requirements; people have run it in
>> old-ish laptops and also the largest instances available in cloud
>> providers. I will let others chime in with more details.
>>
>> I am not aware of something along the lines of a cheatsheet that you
>> mention. If you actually try to do this, I would love to see it, and it
>> might be useful to others as well. Both use similar abstractions at the API
>> level (i.e., parallel collections), so if you stay true to the functional
>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>> internals things should be straightforward. These apply to the batch APIs;
>> the streaming API in Flink follows a true streaming paradigm, where you get
>> an unbounded stream of records and operators on these streams.
>>
>> Funny that you ask about a video for the DataStream slides. There is a
>> Flink training happening as we speak, and a video is being recorded right
>> now :-) Hopefully it will be made available soon.
>>
>> Best,
>> Kostas
>>
>>
>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>> juan.rodriguez.hort...@gmail.com> wrote:
>>
>>> Answering to myself, I have found some nice training material at
>>> http://dataartisans.github.io/flink-training. There are even videos at
>>> youtube for some of the slides
>>>
>>>   - http://dataartisans.github.io/flink-training/overview/intro.html
>>> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>>>
>>>   -
>>> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
>>> https://www.youtube.com/watch?v=0EARqW15dDk
>>>
>>> The third lecture
>>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
>>> more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
>>> but not exactly, and there are more lessons at
>>> http://dataartisans.github.io/flink-training, for stream processing and
>>> the table API for which I haven't found a video. Does anyone have pointers
>>> to the missing videos?
>>>
>>> Greetings,
>>>
>>> Juan
>>>
>>> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com>:
>>>
 Hi list,

 I'm new to Flink, and I find this project very interesting. I have
 experience with Apache Spark, and for I've seen so far I find that Flink
 provides an API at a similar abstraction level but based on single record
 processing instead of batch processing. I've read in Quora that Flink
 extends stream processing to batch processing, while Spark extends batch
 processing to streaming. Therefore I find Flink specially attractive for
 low latency stream processing. Anyway, I would appreciate if someone could
 give some indication about where I could find a list of hardware
 requirements for the slave nodes in a Flink cluster. Something along the
 lines of
 https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark
 is known for having quite high minimal memory requirements (8GB RAM and 8
 cores minimum), and I was wondering if it is also the case for Flink. Lower
 memory requirements would be very interesting for building small Flink
 clusters for educational purposes, or for small projects.

 Apart from that, I wonder if there is some blog post by the comunity
 about transitioning from Spark to Flink. I think it could be interesting,
 as there are some similarities in the APIs, but also deep differences in
 the underlying approaches. I was thinking in something like Breeze's
 cheatsheet comparing its matrix operatations with those available in Matlab
 and Numpy
 https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or
 like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also,
 any pointer to some online course, book or training for Flink besides the
 official programming guides would be much appreciated

 Thanks in advance for help

 Greetings,

Re: Hardware requirements and learning resources

2015-09-02 Thread Robert Metzger
@Jay: I've looked into your code,  but I was not able to reproduce the
issue.
I'll start a new discussion thread on the user@flink list for the
Flink-BigPetStore discussion. I don't want to take over Juan's
hardware-requirements discussion ;)

On Wed, Sep 2, 2015 at 3:01 PM, Jay Vyas 
wrote:

> Just running the main class is sufficient
>
> On Sep 2, 2015, at 8:59 AM, Robert Metzger  wrote:
>
> Hey jay,
>
> How can I reproduce the error?
>
> On Wed, Sep 2, 2015 at 2:56 PM, jay vyas 
> wrote:
>
>> We're also working on a bigpetstore implementation of flink which will
>> help onboard spark/mapreduce folks.
>>
>> I have prototypical code here that runs a simple job in memory,
>> contributions welcome,
>>
>> right now there is a serialization error
>> https://github.com/bigpetstore/bigpetstore-flink .
>>
>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> I think the recommendations in the Spark guide are quite good, and are
>>> similar to what I would recommend for Flink as well.
>>> Depending on the workloads you are interested to run, you can certainly
>>> use Flink with less than 8 GB per machine. I think you can start Flink
>>> TaskManagers with 500 MB of heap space and they'll still be able to process
>>> some GB of data.
>>>
>>> Everything above 2 GB is probably good enough for some initial
>>> experimentation (again depending on your workloads, network, disk speed
>>> etc.)
>>>
>>>
>>>
>>>
>>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>>> wrote:
>>>
 Hi Juan,

 Flink is quite nimble with hardware requirements; people have run it in
 old-ish laptops and also the largest instances available in cloud
 providers. I will let others chime in with more details.

 I am not aware of something along the lines of a cheatsheet that you
 mention. If you actually try to do this, I would love to see it, and it
 might be useful to others as well. Both use similar abstractions at the API
 level (i.e., parallel collections), so if you stay true to the functional
 paradigm and not try to "abuse" the system by exploiting knowledge of its
 internals things should be straightforward. These apply to the batch APIs;
 the streaming API in Flink follows a true streaming paradigm, where you get
 an unbounded stream of records and operators on these streams.

 Funny that you ask about a video for the DataStream slides. There is a
 Flink training happening as we speak, and a video is being recorded right
 now :-) Hopefully it will be made available soon.

 Best,
 Kostas


 On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com> wrote:

> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training. There are even videos
> at youtube for some of the slides
>
>   - http://dataartisans.github.io/flink-training/overview/intro.html
> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>
>   -
> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
> https://www.youtube.com/watch?v=0EARqW15dDk
>
> The third lecture
> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
> more or less corresponds to
> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and
> there are more lessons at http://dataartisans.github.io/flink-training,
> for stream processing and the table API for which I haven't found a
> video. Does anyone have pointers to the missing videos?
>
> Greetings,
>
> Juan
>
> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com>:
>
>> Hi list,
>>
>> I'm new to Flink, and I find this project very interesting. I have
>> experience with Apache Spark, and for I've seen so far I find that Flink
>> provides an API at a similar abstraction level but based on single record
>> processing instead of batch processing. I've read in Quora that Flink
>> extends stream processing to batch processing, while Spark extends batch
>> processing to streaming. Therefore I find Flink specially attractive for
>> low latency stream processing. Anyway, I would appreciate if someone 
>> could
>> give some indication about where I could find a list of hardware
>> requirements for the slave nodes in a Flink cluster. Something along the
>> lines of
>> https://spark.apache.org/docs/latest/hardware-provisioning.html.
>> Spark is known for having quite high minimal memory requirements (8GB RAM
>> and 8 cores minimum), and I was wondering if it is also the case for 
>> Flink.
>> Lower memory requirements would be very interesting for building small
>> Flink clusters