Re: Bigpetstore - Flink integration

2015-09-02 Thread Robert Metzger
Okay, I see.

As I said before, I was not able to reproduce the serialization issue
you've reported.
Can you maybe post the exception you are seeing?

On Wed, Sep 2, 2015 at 3:32 PM, jay vyas 
wrote:

> Hey, thanks!
>
> Those are just seeds, the files aren't large.
>
> The scale out data is the transactions.
>
> The seed data needs to be the same, shipped to ALL nodes, and then
>
> the nodes generate transactions.
>
>
> On Wed, Sep 2, 2015 at 9:21 AM, Robert Metzger 
> wrote:
>
>> I'm starting a new discussion thread for the bigpetstore-flink
>> integration ...
>>
>>
>> I took a closer look into the code you've posted.
>> It seems to me that you are generating a lot of data locally on the
>> client, before you actually submit a job to Flink. (Both "customers" and
>> "stores" are generated locally)
>> Is that only some "seed" data?
>>
>> I would actually try to generate as much data as possible in the cluster,
>> making the generator very scalable.
>>
>> I don't think that you need to register a Kryo serializer for the Product
>> and Transaction type.
>> I was able to run the code without the serializer registration.
>>
>>
>> -- Forwarded message --
>> From: jay vyas 
>> Date: Wed, Sep 2, 2015 at 2:56 PM
>> Subject: Re: Hardware requirements and learning resources
>> To: user@flink.apache.org
>>
>>
>> We're also working on a bigpetstore implementation of flink which will
>> help onboard spark/mapreduce folks.
>>
>> I have prototypical code here that runs a simple job in memory,
>> contributions welcome,
>>
>> right now there is a serialization error
>> https://github.com/bigpetstore/bigpetstore-flink .
>>
>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> I think the recommendations in the Spark guide are quite good, and are
>>> similar to what I would recommend for Flink as well.
>>> Depending on the workloads you are interested to run, you can certainly
>>> use Flink with less than 8 GB per machine. I think you can start Flink
>>> TaskManagers with 500 MB of heap space and they'll still be able to process
>>> some GB of data.
>>>
>>> Everything above 2 GB is probably good enough for some initial
>>> experimentation (again depending on your workloads, network, disk speed
>>> etc.)
>>>
>>>
>>>
>>>
>>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>>> wrote:
>>>
 Hi Juan,

 Flink is quite nimble with hardware requirements; people have run it in
 old-ish laptops and also the largest instances available in cloud
 providers. I will let others chime in with more details.

 I am not aware of something along the lines of a cheatsheet that you
 mention. If you actually try to do this, I would love to see it, and it
 might be useful to others as well. Both use similar abstractions at the API
 level (i.e., parallel collections), so if you stay true to the functional
 paradigm and not try to "abuse" the system by exploiting knowledge of its
 internals things should be straightforward. These apply to the batch APIs;
 the streaming API in Flink follows a true streaming paradigm, where you get
 an unbounded stream of records and operators on these streams.

 Funny that you ask about a video for the DataStream slides. There is a
 Flink training happening as we speak, and a video is being recorded right
 now :-) Hopefully it will be made available soon.

 Best,
 Kostas


 On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com> wrote:

> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training. There are even videos
> at youtube for some of the slides
>
>   - http://dataartisans.github.io/flink-training/overview/intro.html
> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>
>   -
> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
> https://www.youtube.com/watch?v=0EARqW15dDk
>
> The third lecture
> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
> more or less corresponds to
> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and
> there are more lessons at http://dataartisans.github.io/flink-training,
> for stream processing and the table API for which I haven't found a
> video. Does anyone have pointers to the missing videos?
>
> Greetings,
>
> Juan
>
> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com>:
>
>> Hi list,
>>
>> I'm new to Flink, and I find this project very interesting. I have
>> experience with Apache Spark, and for I've seen so far I find that Flink
>> provides an API at a similar abstraction level but based on single record

Re: Bigpetstore - Flink integration

2015-09-02 Thread jay vyas
Hey, thanks!

Those are just seeds, the files aren't large.

The scale out data is the transactions.

The seed data needs to be the same, shipped to ALL nodes, and then

the nodes generate transactions.


On Wed, Sep 2, 2015 at 9:21 AM, Robert Metzger  wrote:

> I'm starting a new discussion thread for the bigpetstore-flink integration
> ...
>
>
> I took a closer look into the code you've posted.
> It seems to me that you are generating a lot of data locally on the
> client, before you actually submit a job to Flink. (Both "customers" and
> "stores" are generated locally)
> Is that only some "seed" data?
>
> I would actually try to generate as much data as possible in the cluster,
> making the generator very scalable.
>
> I don't think that you need to register a Kryo serializer for the Product
> and Transaction type.
> I was able to run the code without the serializer registration.
>
>
> -- Forwarded message --
> From: jay vyas 
> Date: Wed, Sep 2, 2015 at 2:56 PM
> Subject: Re: Hardware requirements and learning resources
> To: user@flink.apache.org
>
>
> We're also working on a bigpetstore implementation of flink which will
> help onboard spark/mapreduce folks.
>
> I have prototypical code here that runs a simple job in memory,
> contributions welcome,
>
> right now there is a serialization error
> https://github.com/bigpetstore/bigpetstore-flink .
>
> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
> wrote:
>
>> Hi Juan,
>>
>> I think the recommendations in the Spark guide are quite good, and are
>> similar to what I would recommend for Flink as well.
>> Depending on the workloads you are interested to run, you can certainly
>> use Flink with less than 8 GB per machine. I think you can start Flink
>> TaskManagers with 500 MB of heap space and they'll still be able to process
>> some GB of data.
>>
>> Everything above 2 GB is probably good enough for some initial
>> experimentation (again depending on your workloads, network, disk speed
>> etc.)
>>
>>
>>
>>
>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> Flink is quite nimble with hardware requirements; people have run it in
>>> old-ish laptops and also the largest instances available in cloud
>>> providers. I will let others chime in with more details.
>>>
>>> I am not aware of something along the lines of a cheatsheet that you
>>> mention. If you actually try to do this, I would love to see it, and it
>>> might be useful to others as well. Both use similar abstractions at the API
>>> level (i.e., parallel collections), so if you stay true to the functional
>>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>>> internals things should be straightforward. These apply to the batch APIs;
>>> the streaming API in Flink follows a true streaming paradigm, where you get
>>> an unbounded stream of records and operators on these streams.
>>>
>>> Funny that you ask about a video for the DataStream slides. There is a
>>> Flink training happening as we speak, and a video is being recorded right
>>> now :-) Hopefully it will be made available soon.
>>>
>>> Best,
>>> Kostas
>>>
>>>
>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com> wrote:
>>>
 Answering to myself, I have found some nice training material at
 http://dataartisans.github.io/flink-training. There are even videos at
 youtube for some of the slides

   - http://dataartisans.github.io/flink-training/overview/intro.html
 https://www.youtube.com/watch?v=XgC6c4Wiqvs

   -
 http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
 https://www.youtube.com/watch?v=0EARqW15dDk

 The third lecture
 http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
 more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
 but not exactly, and there are more lessons at
 http://dataartisans.github.io/flink-training, for stream processing
 and the table API for which I haven't found a video. Does anyone have
 pointers to the missing videos?

 Greetings,

 Juan

 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com>:

> Hi list,
>
> I'm new to Flink, and I find this project very interesting. I have
> experience with Apache Spark, and for I've seen so far I find that Flink
> provides an API at a similar abstraction level but based on single record
> processing instead of batch processing. I've read in Quora that Flink
> extends stream processing to batch processing, while Spark extends batch
> processing to streaming. Therefore I find Flink specially attractive for
> low latency stream processing. Anyway, I would appreciate if someone could
> give some indication about where I could find 

Re: Bigpetstore - Flink integration

2015-09-02 Thread Stephan Ewen
If a lot of the data is generated locally, this may face the same issue as
Greg did with oversized payloads (dropped by Akka).

On Wed, Sep 2, 2015 at 3:21 PM, Robert Metzger  wrote:

> I'm starting a new discussion thread for the bigpetstore-flink integration
> ...
>
>
> I took a closer look into the code you've posted.
> It seems to me that you are generating a lot of data locally on the
> client, before you actually submit a job to Flink. (Both "customers" and
> "stores" are generated locally)
> Is that only some "seed" data?
>
> I would actually try to generate as much data as possible in the cluster,
> making the generator very scalable.
>
> I don't think that you need to register a Kryo serializer for the Product
> and Transaction type.
> I was able to run the code without the serializer registration.
>
>
> -- Forwarded message --
> From: jay vyas 
> Date: Wed, Sep 2, 2015 at 2:56 PM
> Subject: Re: Hardware requirements and learning resources
> To: user@flink.apache.org
>
>
> We're also working on a bigpetstore implementation of flink which will
> help onboard spark/mapreduce folks.
>
> I have prototypical code here that runs a simple job in memory,
> contributions welcome,
>
> right now there is a serialization error
> https://github.com/bigpetstore/bigpetstore-flink .
>
> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
> wrote:
>
>> Hi Juan,
>>
>> I think the recommendations in the Spark guide are quite good, and are
>> similar to what I would recommend for Flink as well.
>> Depending on the workloads you are interested to run, you can certainly
>> use Flink with less than 8 GB per machine. I think you can start Flink
>> TaskManagers with 500 MB of heap space and they'll still be able to process
>> some GB of data.
>>
>> Everything above 2 GB is probably good enough for some initial
>> experimentation (again depending on your workloads, network, disk speed
>> etc.)
>>
>>
>>
>>
>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> Flink is quite nimble with hardware requirements; people have run it in
>>> old-ish laptops and also the largest instances available in cloud
>>> providers. I will let others chime in with more details.
>>>
>>> I am not aware of something along the lines of a cheatsheet that you
>>> mention. If you actually try to do this, I would love to see it, and it
>>> might be useful to others as well. Both use similar abstractions at the API
>>> level (i.e., parallel collections), so if you stay true to the functional
>>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>>> internals things should be straightforward. These apply to the batch APIs;
>>> the streaming API in Flink follows a true streaming paradigm, where you get
>>> an unbounded stream of records and operators on these streams.
>>>
>>> Funny that you ask about a video for the DataStream slides. There is a
>>> Flink training happening as we speak, and a video is being recorded right
>>> now :-) Hopefully it will be made available soon.
>>>
>>> Best,
>>> Kostas
>>>
>>>
>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com> wrote:
>>>
 Answering to myself, I have found some nice training material at
 http://dataartisans.github.io/flink-training. There are even videos at
 youtube for some of the slides

   - http://dataartisans.github.io/flink-training/overview/intro.html
 https://www.youtube.com/watch?v=XgC6c4Wiqvs

   -
 http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
 https://www.youtube.com/watch?v=0EARqW15dDk

 The third lecture
 http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
 more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
 but not exactly, and there are more lessons at
 http://dataartisans.github.io/flink-training, for stream processing
 and the table API for which I haven't found a video. Does anyone have
 pointers to the missing videos?

 Greetings,

 Juan

 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com>:

> Hi list,
>
> I'm new to Flink, and I find this project very interesting. I have
> experience with Apache Spark, and for I've seen so far I find that Flink
> provides an API at a similar abstraction level but based on single record
> processing instead of batch processing. I've read in Quora that Flink
> extends stream processing to batch processing, while Spark extends batch
> processing to streaming. Therefore I find Flink specially attractive for
> low latency stream processing. Anyway, I would appreciate if someone could
> give some indication about where I could find a list of hardware
> requirements for the slave nodes in a Flink cluster. 

Re: Bigpetstore - Flink integration

2015-09-02 Thread jay vyas
hmmm interesting looks to be working magically now... :)  I must have wrote
some code late at night that magically fixed it and forgot.  The original
errors I was getting were kayo related.

The objects aren't being serialized on write to anything useful, but thats
I'm sure an easy fix.

Onward and upward !

On Wed, Sep 2, 2015 at 9:31 AM, Stephan Ewen  wrote:

> If a lot of the data is generated locally, this may face the same issue as
> Greg did with oversized payloads (dropped by Akka).
>
> On Wed, Sep 2, 2015 at 3:21 PM, Robert Metzger 
> wrote:
>
>> I'm starting a new discussion thread for the bigpetstore-flink
>> integration ...
>>
>>
>> I took a closer look into the code you've posted.
>> It seems to me that you are generating a lot of data locally on the
>> client, before you actually submit a job to Flink. (Both "customers" and
>> "stores" are generated locally)
>> Is that only some "seed" data?
>>
>> I would actually try to generate as much data as possible in the cluster,
>> making the generator very scalable.
>>
>> I don't think that you need to register a Kryo serializer for the Product
>> and Transaction type.
>> I was able to run the code without the serializer registration.
>>
>>
>> -- Forwarded message --
>> From: jay vyas 
>> Date: Wed, Sep 2, 2015 at 2:56 PM
>> Subject: Re: Hardware requirements and learning resources
>> To: user@flink.apache.org
>>
>>
>> We're also working on a bigpetstore implementation of flink which will
>> help onboard spark/mapreduce folks.
>>
>> I have prototypical code here that runs a simple job in memory,
>> contributions welcome,
>>
>> right now there is a serialization error
>> https://github.com/bigpetstore/bigpetstore-flink .
>>
>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> I think the recommendations in the Spark guide are quite good, and are
>>> similar to what I would recommend for Flink as well.
>>> Depending on the workloads you are interested to run, you can certainly
>>> use Flink with less than 8 GB per machine. I think you can start Flink
>>> TaskManagers with 500 MB of heap space and they'll still be able to process
>>> some GB of data.
>>>
>>> Everything above 2 GB is probably good enough for some initial
>>> experimentation (again depending on your workloads, network, disk speed
>>> etc.)
>>>
>>>
>>>
>>>
>>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>>> wrote:
>>>
 Hi Juan,

 Flink is quite nimble with hardware requirements; people have run it in
 old-ish laptops and also the largest instances available in cloud
 providers. I will let others chime in with more details.

 I am not aware of something along the lines of a cheatsheet that you
 mention. If you actually try to do this, I would love to see it, and it
 might be useful to others as well. Both use similar abstractions at the API
 level (i.e., parallel collections), so if you stay true to the functional
 paradigm and not try to "abuse" the system by exploiting knowledge of its
 internals things should be straightforward. These apply to the batch APIs;
 the streaming API in Flink follows a true streaming paradigm, where you get
 an unbounded stream of records and operators on these streams.

 Funny that you ask about a video for the DataStream slides. There is a
 Flink training happening as we speak, and a video is being recorded right
 now :-) Hopefully it will be made available soon.

 Best,
 Kostas


 On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com> wrote:

> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training. There are even videos
> at youtube for some of the slides
>
>   - http://dataartisans.github.io/flink-training/overview/intro.html
> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>
>   -
> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
> https://www.youtube.com/watch?v=0EARqW15dDk
>
> The third lecture
> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
> more or less corresponds to
> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and
> there are more lessons at http://dataartisans.github.io/flink-training,
> for stream processing and the table API for which I haven't found a
> video. Does anyone have pointers to the missing videos?
>
> Greetings,
>
> Juan
>
> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com>:
>
>> Hi list,
>>
>> I'm new to Flink, and I find this project very interesting. I have
>> experience with Apache Spark, and for I've seen so far I find that Flink
>> provides an 

Re: Bigpetstore - Flink integration

2015-09-02 Thread jay vyas
hmmm interesting looks to be working magically now... :)  I must have wrote
some code late at night that magically fixed it and forgot.  The original
errors I was getting were kryo related.

The objects aren't being serialized on write to anything useful, but thats
I'm sure an easy fix.

Onward and upward !

On Wed, Sep 2, 2015 at 9:33 AM, Robert Metzger  wrote:

> Okay, I see.
>
> As I said before, I was not able to reproduce the serialization issue
> you've reported.
> Can you maybe post the exception you are seeing?
>
> On Wed, Sep 2, 2015 at 3:32 PM, jay vyas 
> wrote:
>
>> Hey, thanks!
>>
>> Those are just seeds, the files aren't large.
>>
>> The scale out data is the transactions.
>>
>> The seed data needs to be the same, shipped to ALL nodes, and then
>>
>> the nodes generate transactions.
>>
>>
>> On Wed, Sep 2, 2015 at 9:21 AM, Robert Metzger 
>> wrote:
>>
>>> I'm starting a new discussion thread for the bigpetstore-flink
>>> integration ...
>>>
>>>
>>> I took a closer look into the code you've posted.
>>> It seems to me that you are generating a lot of data locally on the
>>> client, before you actually submit a job to Flink. (Both "customers" and
>>> "stores" are generated locally)
>>> Is that only some "seed" data?
>>>
>>> I would actually try to generate as much data as possible in the
>>> cluster, making the generator very scalable.
>>>
>>> I don't think that you need to register a Kryo serializer for the
>>> Product and Transaction type.
>>> I was able to run the code without the serializer registration.
>>>
>>>
>>> -- Forwarded message --
>>> From: jay vyas 
>>> Date: Wed, Sep 2, 2015 at 2:56 PM
>>> Subject: Re: Hardware requirements and learning resources
>>> To: user@flink.apache.org
>>>
>>>
>>> We're also working on a bigpetstore implementation of flink which will
>>> help onboard spark/mapreduce folks.
>>>
>>> I have prototypical code here that runs a simple job in memory,
>>> contributions welcome,
>>>
>>> right now there is a serialization error
>>> https://github.com/bigpetstore/bigpetstore-flink .
>>>
>>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
>>> wrote:
>>>
 Hi Juan,

 I think the recommendations in the Spark guide are quite good, and are
 similar to what I would recommend for Flink as well.
 Depending on the workloads you are interested to run, you can certainly
 use Flink with less than 8 GB per machine. I think you can start Flink
 TaskManagers with 500 MB of heap space and they'll still be able to process
 some GB of data.

 Everything above 2 GB is probably good enough for some initial
 experimentation (again depending on your workloads, network, disk speed
 etc.)




 On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
 wrote:

> Hi Juan,
>
> Flink is quite nimble with hardware requirements; people have run it
> in old-ish laptops and also the largest instances available in cloud
> providers. I will let others chime in with more details.
>
> I am not aware of something along the lines of a cheatsheet that you
> mention. If you actually try to do this, I would love to see it, and it
> might be useful to others as well. Both use similar abstractions at the 
> API
> level (i.e., parallel collections), so if you stay true to the functional
> paradigm and not try to "abuse" the system by exploiting knowledge of its
> internals things should be straightforward. These apply to the batch APIs;
> the streaming API in Flink follows a true streaming paradigm, where you 
> get
> an unbounded stream of records and operators on these streams.
>
> Funny that you ask about a video for the DataStream slides. There is a
> Flink training happening as we speak, and a video is being recorded right
> now :-) Hopefully it will be made available soon.
>
> Best,
> Kostas
>
>
> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com> wrote:
>
>> Answering to myself, I have found some nice training material at
>> http://dataartisans.github.io/flink-training. There are even videos
>> at youtube for some of the slides
>>
>>   - http://dataartisans.github.io/flink-training/overview/intro.html
>> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>>
>>   -
>> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
>> https://www.youtube.com/watch?v=0EARqW15dDk
>>
>> The third lecture
>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
>> more or less corresponds to
>> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and
>> there are more lessons at
>> http://dataartisans.github.io/flink-training, for stream