Re: TeraSort on Flink and Spark

2015-07-03 Thread Stephan Ewen
Flavio,

In general, String works well in Flink, because it behaves for sorting much
like this OptimizedText.

If you want to access the String contents, then using String is good. Text
may have slight advantages if you never access the actual contents, but
just partition and sort it (as in TeraSort).

The key length is limited to 10, because in TeraSort, the keys are defined
to be 10 characters long ;-)

Greetings,
Stephan


On Thu, Jul 2, 2015 at 9:14 PM, Flavio Pompermaier 
wrote:

> Hi Stephan,
> if I understood correctly you are substituting the Text key with a more
> efficient version (OptimizedText).
> Just one question: you set max lenght of the key to 10..you know that a
> priori?
> This implementation of the key is much more efficient that just using
> String?
> Is there any comparison about that?
>
> Best,
> Flavio
> On 2 Jul 2015 20:29, "Stephan Ewen"  wrote:
>
>> Hello Dongwon Kim!
>>
>> Thanks you for sharing these numbers with us.
>>
>> I have gone through your implementation and there are two things you
>> could try:
>>
>> 1)
>>
>> I see that you sort Hadoop's Text data type with Flink. I think this may
>> be less efficient than if you sort String, or a Flink specific data type.
>>
>> For efficient byte operations on managed memory, Flink needs to
>> understand the binary representation of the data type. Flink understands
>> that for "String" and many other types, but not for "Text".
>>
>> There are two things you can do:
>>   - First, try what happens if you map the Hadoop Text type to a Java
>> String (only for the tera key).
>>   - Second, you can try what happens if you wrap the Hadoop Text type in
>> a Flink type that supports optimized binary sorting. I have pasted code for
>> that at the bottom of this email.
>>
>> 2)
>>
>> You can see if it helps performance if you enable object re-use in Flink.
>> You can do this on the ExecutionEnvironment via
>> "env.getConfig().enableObjectReuse()". Then Flink tries to use the same
>> objects repeatedly, in case they are mutable.
>>
>>
>> Can you try these options out and see how they affect Flink's runtime?
>>
>>
>> Greetings,
>> Stephan
>>
>> -
>> Code for optimized sortable (Java):
>>
>> public final class OptimizedText implements
>> NormalizableKey {
>>  private final Text text;
>>  public OptimizedText () {
>> this.text = new Text();
>> }
>>  public OptimizedText (Text from) {
>> this.text = from;
>> }
>>
>> public Text getText() {
>> return text;
>> }
>>
>> @Override
>> public int getMaxNormalizedKeyLen() {
>> return 10;
>> }
>>
>> @Override
>> public void copyNormalizedKey(MemorySegment memory, int offset, int len) {
>> memory.put(offset, text.getBytes(), 0, Math.min(text.getLength(),
>> Math.min(10, len)));
>> }
>>
>> @Override
>> public void write(DataOutputView out) throws IOException {
>> text.write(out);
>> }
>>
>> @Override
>> public void read(DataInputView in) throws IOException {
>> text.readFields(in);
>> }
>>
>> @Override
>> public int compareTo(OptimizedText o) {
>> return this.text.compareTo(o.text);
>> }
>> }
>>
>> -
>> Converting Text to OptimizedText (Java code)
>>
>> map(new MapFunction, Tuple2>() {
>>  @Override
>> public Tuple2 map(Tuple2 value) {
>> return new Tuple2(new OptimizedText(value.f0),
>> value.f1);
>> }
>> })
>>
>>
>>
>>
>> On Thu, Jul 2, 2015 at 6:47 PM, Dongwon Kim 
>> wrote:
>>
>>> Hello,
>>>
>>> I'd like to share my code for TeraSort on Flink and Spark which uses
>>> the same range partitioner as Hadoop TeraSort:
>>> https://github.com/eastcirclek/terasort
>>>
>>> I also write a short report on it:
>>>
>>> http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html
>>> In the blog post, I make a simple performance comparison between
>>> Flink, Spark, Tez, and MapReduce.
>>>
>>> I hope it will be helpful to you guys!
>>> Thanks.
>>>
>>> Dongwon Kim
>>> Postdoctoral Researcher @ Postech
>>>
>>
>>


Re: writeAsCsv not writing anything on HDFS when WriteMode set to OVERWRITE

2015-07-03 Thread Maximilian Michels
You're welcome. I'm glad I could help out :)

Cheers,
Max

On Thu, Jul 2, 2015 at 9:17 PM, Mihail Vieru 
wrote:

>  I've implemented the alternating 2 files solution and everything works
> now.
>
> Thanks a lot! You saved my day :)
>
> Cheers,
> Mihail
>
>
> On 02.07.2015 12:37, Maximilian Michels wrote:
>
>   The problem is that your input and output path are the same. Because
> Flink executes in a pipelined fashion, all the operators will come up at
> once. When you set WriteMode.OVERWRITE for the sink, it will delete the
> path before writing anything. That means that when your DataSource reads
> the input, there will be nothing to read from. Thus you get an empty
> DataSet which you write to HDFS again. Any further loops will then just
> write nothing.
>
>  You can circumvent this problem, by prefixing every output file with a
> counter that you increment in your loop. Alternatively, if you only want to
> keep the latest output, you can use two files and let them alternate to be
> input and output.
>
>  Let me know if you have any further questions.
>
>  Kind regards,
>  Max
>
> On Thu, Jul 2, 2015 at 10:20 AM, Maximilian Michels 
> wrote:
>
>> Hi Mihail,
>>
>> Thanks for the code. I'm trying to reproduce the problem now.
>>
>> On Wed, Jul 1, 2015 at 8:30 PM, Mihail Vieru <
>> vi...@informatik.hu-berlin.de> wrote:
>>
>>>  Hi Max,
>>>
>>> thank you for your reply. I wanted to revise and dismiss all other
>>> factors before writing back. I've attached you my code and sample input
>>> data.
>>>
>>> I run the *APSPNaiveJob* using the following arguments:
>>>
>>> *0 100 hdfs://path/to/vertices-test-100 hdfs://path/to/edges-test-100
>>> hdfs://path/to/tempgraph 10 0.5 hdfs://path/to/output-apsp 9*
>>>
>>> I was wrong, I originally thought that the first writeAsCsv call (line
>>> 50) doesn't work. An exception is thrown without the WriteMode.OVERWRITE
>>> when the file exists.
>>>
>>> But the problem lies with the second call (line 74), trying to write to
>>> the same path on HDFS.
>>>
>>> This issue is blocking me, because I need to persist the vertices
>>> dataset between iterations.
>>>
>>> Cheers,
>>> Mihail
>>>
>>> P.S.: I'm using the latest 0.10-SNAPSHOT and HDFS 1.2.1.
>>>
>>>
>>>
>>> On 30.06.2015 16:51, Maximilian Michels wrote:
>>>
>>>   HI Mihail,
>>>
>>>  Thank you for your question. Do you have a short example that
>>> reproduces the problem? It is hard to find the cause without an error
>>> message or some example code.
>>>
>>>  I wonder how your loop works without WriteMode.OVERWRITE because it
>>> should throw an exception in this case. Or do you change the file names on
>>> every write?
>>>
>>>  Cheers,
>>>  Max
>>>
>>> On Tue, Jun 30, 2015 at 3:47 PM, Mihail Vieru <
>>> vi...@informatik.hu-berlin.de> wrote:
>>>
  I think my problem is related to a loop in my job.

 Before the loop, the writeAsCsv method works fine, even in overwrite
 mode.

 In the loop, in the first iteration, it writes an empty folder
 containing empty files to HDFS. Even though the DataSet it is supposed to
 write contains elements.

 Needless to say, this doesn't occur in a local execution environment,
 when writing to the local file system.


 I would appreciate any input on this.

 Best,
 Mihail



 On 30.06.2015 12:10, Mihail Vieru wrote:

 Hi Till,

 thank you for your reply.

 I have the following code snippet:

 *intermediateGraph.getVertices().writeAsCsv(tempGraphOutputPath, "\n",
 ";", WriteMode.OVERWRITE);*

 When I remove the WriteMode parameter, it works. So I can reason that
 the DataSet contains data elements.

 Cheers,
 Mihail


 On 30.06.2015 12:06, Till Rohrmann wrote:

  Hi Mihail,

 have you checked that the DataSet you want to write to HDFS actually
 contains data elements? You can try calling collect which retrieves
 the data to your client to see what’s in there.

 Cheers,
 Till
 ​

 On Tue, Jun 30, 2015 at 12:01 PM, Mihail Vieru <
 vi...@informatik.hu-berlin.de> wrote:

> Hi,
>
> the writeAsCsv method is not writing anything to HDFS (version 1.2.1)
> when the WriteMode is set to OVERWRITE.
> A file is created but it's empty. And no trace of errors in the Flink
> or Hadoop logs on all nodes in the cluster.
>
> What could cause this issue? I really really need this feature..
>
> Best,
> Mihail
>




>>>
>>>
>>
>
>


Open method is not called with custom implementation RichWindowMapFunction

2015-07-03 Thread tambunanw
Hi All, 

I'm trying to create some experiment with rich windowing function and
operator state. I modify the streaming stock prices from 

https://github.com/mbalassi/flink/blob/stockprices/flink-staging/flink-streaming/flink-streaming-examples/src/main/scala/org/apache/flink/streaming/scala/examples/windowing/StockPrices.scala

I create the simple windowing function like below

class MyWindowFunction extends RichWindowMapFunction[StockPricex,
StockPricex] {
  println("created")
  private var counter = 0

  override def open(conf: Configuration): Unit = {
println("opened")
  }

  override def mapWindow(values: Iterable[StockPricex], out:
Collector[StockPricex]): Unit = {
// if not initialized ..

println(counter)
println(values)
counter = counter + 1

  }
}

However the open() method is not invoked when i try to run this code on my
local environment

spx
  .groupBy(x => x.symbol)
  .window(Time.of(5, TimeUnit.SECONDS)).every(Time.of(1,
TimeUnit.SECONDS))
  .mapWindow(new MyWindowFunction)

Any thought on this one ?


Cheers



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Open-method-is-not-called-with-custom-implementation-RichWindowMapFunction-tp1924.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Open method is not called with custom implementation RichWindowMapFunction

2015-07-03 Thread Chiwan Park
Hi tambunanw,

The issue is already known and we’ll patch soon. [1]
In next release (maybe 0.9.1), the problem will be solved.

Regards,
Chiwan Park

[1] https://issues.apache.org/jira/browse/FLINK-2257

> On Jul 3, 2015, at 4:57 PM, tambunanw  wrote:
> 
> Hi All, 
> 
> I'm trying to create some experiment with rich windowing function and
> operator state. I modify the streaming stock prices from 
> 
> https://github.com/mbalassi/flink/blob/stockprices/flink-staging/flink-streaming/flink-streaming-examples/src/main/scala/org/apache/flink/streaming/scala/examples/windowing/StockPrices.scala
> 
> I create the simple windowing function like below
> 
> class MyWindowFunction extends RichWindowMapFunction[StockPricex,
> StockPricex] {
>  println("created")
>  private var counter = 0
> 
>  override def open(conf: Configuration): Unit = {
>println("opened")
>  }
> 
>  override def mapWindow(values: Iterable[StockPricex], out:
> Collector[StockPricex]): Unit = {
>// if not initialized ..
> 
>println(counter)
>println(values)
>counter = counter + 1
> 
>  }
> }
> 
> However the open() method is not invoked when i try to run this code on my
> local environment
> 
>spx
>  .groupBy(x => x.symbol)
>  .window(Time.of(5, TimeUnit.SECONDS)).every(Time.of(1,
> TimeUnit.SECONDS))
>  .mapWindow(new MyWindowFunction)
> 
> Any thought on this one ?
> 
> 
> Cheers
> 
> 
> 
> --
> View this message in context: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Open-method-is-not-called-with-custom-implementation-RichWindowMapFunction-tp1924.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at 
> Nabble.com.







Re: Open method is not called with custom implementation RichWindowMapFunction

2015-07-03 Thread Welly Tambunan
Thanks Chiwan,


Glad to hear that.


Cheers

On Fri, Jul 3, 2015 at 3:24 PM, Chiwan Park  wrote:

> Hi tambunanw,
>
> The issue is already known and we’ll patch soon. [1]
> In next release (maybe 0.9.1), the problem will be solved.
>
> Regards,
> Chiwan Park
>
> [1] https://issues.apache.org/jira/browse/FLINK-2257
>
> > On Jul 3, 2015, at 4:57 PM, tambunanw  wrote:
> >
> > Hi All,
> >
> > I'm trying to create some experiment with rich windowing function and
> > operator state. I modify the streaming stock prices from
> >
> >
> https://github.com/mbalassi/flink/blob/stockprices/flink-staging/flink-streaming/flink-streaming-examples/src/main/scala/org/apache/flink/streaming/scala/examples/windowing/StockPrices.scala
> >
> > I create the simple windowing function like below
> >
> > class MyWindowFunction extends RichWindowMapFunction[StockPricex,
> > StockPricex] {
> >  println("created")
> >  private var counter = 0
> >
> >  override def open(conf: Configuration): Unit = {
> >println("opened")
> >  }
> >
> >  override def mapWindow(values: Iterable[StockPricex], out:
> > Collector[StockPricex]): Unit = {
> >// if not initialized ..
> >
> >println(counter)
> >println(values)
> >counter = counter + 1
> >
> >  }
> > }
> >
> > However the open() method is not invoked when i try to run this code on
> my
> > local environment
> >
> >spx
> >  .groupBy(x => x.symbol)
> >  .window(Time.of(5, TimeUnit.SECONDS)).every(Time.of(1,
> > TimeUnit.SECONDS))
> >  .mapWindow(new MyWindowFunction)
> >
> > Any thought on this one ?
> >
> >
> > Cheers
> >
> >
> >
> > --
> > View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Open-method-is-not-called-with-custom-implementation-RichWindowMapFunction-tp1924.html
> > Sent from the Apache Flink User Mailing List archive. mailing list
> archive at Nabble.com.
>
>
>
>
>
>


-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Flink Streaming : PartitionBy vs GroupBy differences

2015-07-03 Thread tambunanw
Hi All, 

I'm trying to digest what's the difference between this two. From my
experience in Spark GroupBy will cause shuffling on the network. Is that the
same case in Flink ? 

I've watch videos and read a couple docs about Flink that's actually Flink
will compile the user code into it's own optimized graph structure so i
think Flink engine will take care of this one ?

>From the docs for Partitioning

http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning

Is that true that GroupBy is more advanced than PartitionBy ? Can someone
elaborate ? 

I think this one is really confusing for me that come from Spark world. Any
help would be really appreciated. 

Cheers 





--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Open method is not called with custom implementation RichWindowMapFunction

2015-07-03 Thread Chiwan Park
I found that the patch had been merged to upstream. [1] :)

Regards,
Chiwan Park

[1] https://github.com/apache/flink/pull/855

> On Jul 3, 2015, at 5:26 PM, Welly Tambunan  wrote:
> 
> Thanks Chiwan, 
> 
> 
> Glad to hear that. 
> 
> 
> Cheers
> 
> On Fri, Jul 3, 2015 at 3:24 PM, Chiwan Park  wrote:
> Hi tambunanw,
> 
> The issue is already known and we’ll patch soon. [1]
> In next release (maybe 0.9.1), the problem will be solved.
> 
> Regards,
> Chiwan Park
> 
> [1] https://issues.apache.org/jira/browse/FLINK-2257
> 
> > On Jul 3, 2015, at 4:57 PM, tambunanw  wrote:
> >
> > Hi All,
> >
> > I'm trying to create some experiment with rich windowing function and
> > operator state. I modify the streaming stock prices from
> >
> > https://github.com/mbalassi/flink/blob/stockprices/flink-staging/flink-streaming/flink-streaming-examples/src/main/scala/org/apache/flink/streaming/scala/examples/windowing/StockPrices.scala
> >
> > I create the simple windowing function like below
> >
> > class MyWindowFunction extends RichWindowMapFunction[StockPricex,
> > StockPricex] {
> >  println("created")
> >  private var counter = 0
> >
> >  override def open(conf: Configuration): Unit = {
> >println("opened")
> >  }
> >
> >  override def mapWindow(values: Iterable[StockPricex], out:
> > Collector[StockPricex]): Unit = {
> >// if not initialized ..
> >
> >println(counter)
> >println(values)
> >counter = counter + 1
> >
> >  }
> > }
> >
> > However the open() method is not invoked when i try to run this code on my
> > local environment
> >
> >spx
> >  .groupBy(x => x.symbol)
> >  .window(Time.of(5, TimeUnit.SECONDS)).every(Time.of(1,
> > TimeUnit.SECONDS))
> >  .mapWindow(new MyWindowFunction)
> >
> > Any thought on this one ?
> >
> >
> > Cheers
> >
> >
> >
> > --
> > View this message in context: 
> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Open-method-is-not-called-with-custom-implementation-RichWindowMapFunction-tp1924.html
> > Sent from the Apache Flink User Mailing List archive. mailing list archive 
> > at Nabble.com.
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Welly Tambunan
> Triplelands 
> 
> http://weltam.wordpress.com
> http://www.triplelands.com



Re: Open method is not called with custom implementation RichWindowMapFunction

2015-07-03 Thread Welly Tambunan
Thanks Chiwan

Great Job !

Cheers

On Fri, Jul 3, 2015 at 3:32 PM, Chiwan Park  wrote:

> I found that the patch had been merged to upstream. [1] :)
>
> Regards,
> Chiwan Park
>
> [1] https://github.com/apache/flink/pull/855
>
> > On Jul 3, 2015, at 5:26 PM, Welly Tambunan  wrote:
> >
> > Thanks Chiwan,
> >
> >
> > Glad to hear that.
> >
> >
> > Cheers
> >
> > On Fri, Jul 3, 2015 at 3:24 PM, Chiwan Park 
> wrote:
> > Hi tambunanw,
> >
> > The issue is already known and we’ll patch soon. [1]
> > In next release (maybe 0.9.1), the problem will be solved.
> >
> > Regards,
> > Chiwan Park
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-2257
> >
> > > On Jul 3, 2015, at 4:57 PM, tambunanw  wrote:
> > >
> > > Hi All,
> > >
> > > I'm trying to create some experiment with rich windowing function and
> > > operator state. I modify the streaming stock prices from
> > >
> > >
> https://github.com/mbalassi/flink/blob/stockprices/flink-staging/flink-streaming/flink-streaming-examples/src/main/scala/org/apache/flink/streaming/scala/examples/windowing/StockPrices.scala
> > >
> > > I create the simple windowing function like below
> > >
> > > class MyWindowFunction extends RichWindowMapFunction[StockPricex,
> > > StockPricex] {
> > >  println("created")
> > >  private var counter = 0
> > >
> > >  override def open(conf: Configuration): Unit = {
> > >println("opened")
> > >  }
> > >
> > >  override def mapWindow(values: Iterable[StockPricex], out:
> > > Collector[StockPricex]): Unit = {
> > >// if not initialized ..
> > >
> > >println(counter)
> > >println(values)
> > >counter = counter + 1
> > >
> > >  }
> > > }
> > >
> > > However the open() method is not invoked when i try to run this code
> on my
> > > local environment
> > >
> > >spx
> > >  .groupBy(x => x.symbol)
> > >  .window(Time.of(5, TimeUnit.SECONDS)).every(Time.of(1,
> > > TimeUnit.SECONDS))
> > >  .mapWindow(new MyWindowFunction)
> > >
> > > Any thought on this one ?
> > >
> > >
> > > Cheers
> > >
> > >
> > >
> > > --
> > > View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Open-method-is-not-called-with-custom-implementation-RichWindowMapFunction-tp1924.html
> > > Sent from the Apache Flink User Mailing List archive. mailing list
> archive at Nabble.com.
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > Welly Tambunan
> > Triplelands
> >
> > http://weltam.wordpress.com
> > http://www.triplelands.com
>
>


-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Flink Streaming : PartitionBy vs GroupBy differences

2015-07-03 Thread Gyula Fóra
Hey!

Both groupBy and partitionBy will trigger a shuffle over the network based
on some key, assuring that elements with the same keys end up on the same
downstream processing operator.

The difference between the two is that groupBy in addition to this returns
a GroupedDataStream which lets you execute some special operations, such as
key based rolling aggregates.

PartitionBy is useful when you are using simple operators but still want to
control the messages received by parallel instances (in a mapper for
example).

Cheers,
Gyula

tambunanw  ezt írta (időpont: 2015. júl. 3., P, 10:32):

> Hi All,
>
> I'm trying to digest what's the difference between this two. From my
> experience in Spark GroupBy will cause shuffling on the network. Is that
> the
> same case in Flink ?
>
> I've watch videos and read a couple docs about Flink that's actually Flink
> will compile the user code into it's own optimized graph structure so i
> think Flink engine will take care of this one ?
>
> From the docs for Partitioning
>
>
> http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning
>
> Is that true that GroupBy is more advanced than PartitionBy ? Can someone
> elaborate ?
>
> I think this one is really confusing for me that come from Spark world. Any
> help would be really appreciated.
>
> Cheers
>
>
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


In windows8 + VitualBox, how to build Flink development environment?

2015-07-03 Thread Chenliang (Liang, DataSight)
Dear

In windows8 + VitualBox, how to build Flink development environment?


Re: In windows8 + VitualBox, how to build Flink development environment?

2015-07-03 Thread Stephan Ewen
Hi!

With Windows 8 + VirtualBox, I would run a Linux VM. I run Ubuntu in
VirtualBox on Windows 7 myself.

In the Linux environment, make sure you have git, Java 7+ and Maven, see
also here:
https://github.com/apache/flink/blob/master/README.md#building-apache-flink-from-source

For development, I recommend IntelliJ:
https://github.com/apache/flink/blob/master/docs/internals/ide_setup.md#intellij-idea

Greetings,
Stephan


On Fri, Jul 3, 2015 at 11:16 AM, Chenliang (Liang, DataSight) <
chenliang...@huawei.com> wrote:

>  Dear
>
>
>
> In windows8 + VitualBox, how to build Flink development environment?
>


Re: In windows8 + VitualBox, how to build Flink development environment?

2015-07-03 Thread Stephan Ewen
Let us know if you run into setup problems...

On Fri, Jul 3, 2015 at 11:25 AM, Stephan Ewen  wrote:

> Hi!
>
> With Windows 8 + VirtualBox, I would run a Linux VM. I run Ubuntu in
> VirtualBox on Windows 7 myself.
>
> In the Linux environment, make sure you have git, Java 7+ and Maven, see
> also here:
> https://github.com/apache/flink/blob/master/README.md#building-apache-flink-from-source
>
> For development, I recommend IntelliJ:
> https://github.com/apache/flink/blob/master/docs/internals/ide_setup.md#intellij-idea
>
> Greetings,
> Stephan
>
>
> On Fri, Jul 3, 2015 at 11:16 AM, Chenliang (Liang, DataSight) <
> chenliang...@huawei.com> wrote:
>
>>  Dear
>>
>>
>>
>> In windows8 + VitualBox, how to build Flink development environment?
>>
>
>


Re: Flink Streaming : PartitionBy vs GroupBy differences

2015-07-03 Thread Welly Tambunan
Hi Gyula,

Thanks for your response.

So if i use partitionBy then data point with the same will receive exactly
by the same instance of operator ?


Another question is if i execute reduce() operator on after partitionBy,
will that reduce operator guarantee ordering within the same key ?


Cheers

On Fri, Jul 3, 2015 at 4:14 PM, Gyula Fóra  wrote:

> Hey!
>
> Both groupBy and partitionBy will trigger a shuffle over the network based
> on some key, assuring that elements with the same keys end up on the same
> downstream processing operator.
>
> The difference between the two is that groupBy in addition to this returns
> a GroupedDataStream which lets you execute some special operations, such as
> key based rolling aggregates.
>
> PartitionBy is useful when you are using simple operators but still want
> to control the messages received by parallel instances (in a mapper for
> example).
>
> Cheers,
> Gyula
>
> tambunanw  ezt írta (időpont: 2015. júl. 3., P, 10:32):
>
>> Hi All,
>>
>> I'm trying to digest what's the difference between this two. From my
>> experience in Spark GroupBy will cause shuffling on the network. Is that
>> the
>> same case in Flink ?
>>
>> I've watch videos and read a couple docs about Flink that's actually Flink
>> will compile the user code into it's own optimized graph structure so i
>> think Flink engine will take care of this one ?
>>
>> From the docs for Partitioning
>>
>>
>> http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning
>>
>> Is that true that GroupBy is more advanced than PartitionBy ? Can someone
>> elaborate ?
>>
>> I think this one is really confusing for me that come from Spark world.
>> Any
>> help would be really appreciated.
>>
>> Cheers
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>


-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Flink Streaming : PartitionBy vs GroupBy differences

2015-07-03 Thread Gyula Fóra
Hey,

1.
Yes, if you use partitionBy the same key will always go to the same
downstream operator instance.

2.
There is only partial ordering guarantee, meaning that data received from
one input is FIFO. This means that if the same key is coming from multiple
inputs than there is no ordering guarantee there, only inside one input.

Gyula

Welly Tambunan  ezt írta (időpont: 2015. júl. 3., P,
11:51):

> Hi Gyula,
>
> Thanks for your response.
>
> So if i use partitionBy then data point with the same will receive exactly
> by the same instance of operator ?
>
>
> Another question is if i execute reduce() operator on after partitionBy,
> will that reduce operator guarantee ordering within the same key ?
>
>
> Cheers
>
> On Fri, Jul 3, 2015 at 4:14 PM, Gyula Fóra  wrote:
>
>> Hey!
>>
>> Both groupBy and partitionBy will trigger a shuffle over the network
>> based on some key, assuring that elements with the same keys end up on the
>> same downstream processing operator.
>>
>> The difference between the two is that groupBy in addition to this
>> returns a GroupedDataStream which lets you execute some special operations,
>> such as key based rolling aggregates.
>>
>> PartitionBy is useful when you are using simple operators but still want
>> to control the messages received by parallel instances (in a mapper for
>> example).
>>
>> Cheers,
>> Gyula
>>
>> tambunanw  ezt írta (időpont: 2015. júl. 3., P,
>> 10:32):
>>
>>> Hi All,
>>>
>>> I'm trying to digest what's the difference between this two. From my
>>> experience in Spark GroupBy will cause shuffling on the network. Is that
>>> the
>>> same case in Flink ?
>>>
>>> I've watch videos and read a couple docs about Flink that's actually
>>> Flink
>>> will compile the user code into it's own optimized graph structure so i
>>> think Flink engine will take care of this one ?
>>>
>>> From the docs for Partitioning
>>>
>>>
>>> http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning
>>>
>>> Is that true that GroupBy is more advanced than PartitionBy ? Can someone
>>> elaborate ?
>>>
>>> I think this one is really confusing for me that come from Spark world.
>>> Any
>>> help would be really appreciated.
>>>
>>> Cheers
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive at Nabble.com.
>>>
>>
>
>
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com 
>


Re: Flink Streaming : PartitionBy vs GroupBy differences

2015-07-03 Thread Welly Tambunan
Hi Gyula,

Thanks a lot. That's enough for my case.

I do really love Flink Streaming model compare to Spark Streaming.

So is that true that i can think that Operator as an Actor model in this
system ? Is that a right way to put it ?



Cheers

On Fri, Jul 3, 2015 at 5:29 PM, Gyula Fóra  wrote:

> Hey,
>
> 1.
> Yes, if you use partitionBy the same key will always go to the same
> downstream operator instance.
>
> 2.
> There is only partial ordering guarantee, meaning that data received from
> one input is FIFO. This means that if the same key is coming from multiple
> inputs than there is no ordering guarantee there, only inside one input.
>
> Gyula
>
> Welly Tambunan  ezt írta (időpont: 2015. júl. 3., P,
> 11:51):
>
>> Hi Gyula,
>>
>> Thanks for your response.
>>
>> So if i use partitionBy then data point with the same will receive
>> exactly by the same instance of operator ?
>>
>>
>> Another question is if i execute reduce() operator on after partitionBy,
>> will that reduce operator guarantee ordering within the same key ?
>>
>>
>> Cheers
>>
>> On Fri, Jul 3, 2015 at 4:14 PM, Gyula Fóra  wrote:
>>
>>> Hey!
>>>
>>> Both groupBy and partitionBy will trigger a shuffle over the network
>>> based on some key, assuring that elements with the same keys end up on the
>>> same downstream processing operator.
>>>
>>> The difference between the two is that groupBy in addition to this
>>> returns a GroupedDataStream which lets you execute some special operations,
>>> such as key based rolling aggregates.
>>>
>>> PartitionBy is useful when you are using simple operators but still want
>>> to control the messages received by parallel instances (in a mapper for
>>> example).
>>>
>>> Cheers,
>>> Gyula
>>>
>>> tambunanw  ezt írta (időpont: 2015. júl. 3., P,
>>> 10:32):
>>>
 Hi All,

 I'm trying to digest what's the difference between this two. From my
 experience in Spark GroupBy will cause shuffling on the network. Is
 that the
 same case in Flink ?

 I've watch videos and read a couple docs about Flink that's actually
 Flink
 will compile the user code into it's own optimized graph structure so i
 think Flink engine will take care of this one ?

 From the docs for Partitioning


 http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning

 Is that true that GroupBy is more advanced than PartitionBy ? Can
 someone
 elaborate ?

 I think this one is really confusing for me that come from Spark world.
 Any
 help would be really appreciated.

 Cheers





 --
 View this message in context:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
 Sent from the Apache Flink User Mailing List archive. mailing list
 archive at Nabble.com.

>>>
>>
>>
>> --
>> Welly Tambunan
>> Triplelands
>>
>> http://weltam.wordpress.com
>> http://www.triplelands.com 
>>
>


-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Flink Streaming : PartitionBy vs GroupBy differences

2015-07-03 Thread Gyula Fóra
Yes, you can think of it that way. Each Operator has parallel instances and
each parallel instance receives input from multiple channels (FIFO from
each) and produces output.

Welly Tambunan  ezt írta (időpont: 2015. júl. 3., P,
13:02):

> Hi Gyula,
>
> Thanks a lot. That's enough for my case.
>
> I do really love Flink Streaming model compare to Spark Streaming.
>
> So is that true that i can think that Operator as an Actor model in this
> system ? Is that a right way to put it ?
>
>
>
> Cheers
>
> On Fri, Jul 3, 2015 at 5:29 PM, Gyula Fóra  wrote:
>
>> Hey,
>>
>> 1.
>> Yes, if you use partitionBy the same key will always go to the same
>> downstream operator instance.
>>
>> 2.
>> There is only partial ordering guarantee, meaning that data received from
>> one input is FIFO. This means that if the same key is coming from multiple
>> inputs than there is no ordering guarantee there, only inside one input.
>>
>> Gyula
>>
>> Welly Tambunan  ezt írta (időpont: 2015. júl. 3., P,
>> 11:51):
>>
>>> Hi Gyula,
>>>
>>> Thanks for your response.
>>>
>>> So if i use partitionBy then data point with the same will receive
>>> exactly by the same instance of operator ?
>>>
>>>
>>> Another question is if i execute reduce() operator on after partitionBy,
>>> will that reduce operator guarantee ordering within the same key ?
>>>
>>>
>>> Cheers
>>>
>>> On Fri, Jul 3, 2015 at 4:14 PM, Gyula Fóra  wrote:
>>>
 Hey!

 Both groupBy and partitionBy will trigger a shuffle over the network
 based on some key, assuring that elements with the same keys end up on the
 same downstream processing operator.

 The difference between the two is that groupBy in addition to this
 returns a GroupedDataStream which lets you execute some special operations,
 such as key based rolling aggregates.

 PartitionBy is useful when you are using simple operators but still
 want to control the messages received by parallel instances (in a mapper
 for example).

 Cheers,
 Gyula

 tambunanw  ezt írta (időpont: 2015. júl. 3., P,
 10:32):

> Hi All,
>
> I'm trying to digest what's the difference between this two. From my
> experience in Spark GroupBy will cause shuffling on the network. Is
> that the
> same case in Flink ?
>
> I've watch videos and read a couple docs about Flink that's actually
> Flink
> will compile the user code into it's own optimized graph structure so i
> think Flink engine will take care of this one ?
>
> From the docs for Partitioning
>
>
> http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning
>
> Is that true that GroupBy is more advanced than PartitionBy ? Can
> someone
> elaborate ?
>
> I think this one is really confusing for me that come from Spark
> world. Any
> help would be really appreciated.
>
> Cheers
>
>
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
> Sent from the Apache Flink User Mailing List archive. mailing list
> archive at Nabble.com.
>

>>>
>>>
>>> --
>>> Welly Tambunan
>>> Triplelands
>>>
>>> http://weltam.wordpress.com
>>> http://www.triplelands.com 
>>>
>>
>
>
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com 
>


Re: Flink Streaming : PartitionBy vs GroupBy differences

2015-07-03 Thread Welly Tambunan
Thanks Gyula


Cheers

On Fri, Jul 3, 2015 at 6:19 PM, Gyula Fóra  wrote:

> Yes, you can think of it that way. Each Operator has parallel instances
> and each parallel instance receives input from multiple channels (FIFO from
> each) and produces output.
>
> Welly Tambunan  ezt írta (időpont: 2015. júl. 3., P,
> 13:02):
>
>> Hi Gyula,
>>
>> Thanks a lot. That's enough for my case.
>>
>> I do really love Flink Streaming model compare to Spark Streaming.
>>
>> So is that true that i can think that Operator as an Actor model in this
>> system ? Is that a right way to put it ?
>>
>>
>>
>> Cheers
>>
>> On Fri, Jul 3, 2015 at 5:29 PM, Gyula Fóra  wrote:
>>
>>> Hey,
>>>
>>> 1.
>>> Yes, if you use partitionBy the same key will always go to the same
>>> downstream operator instance.
>>>
>>> 2.
>>> There is only partial ordering guarantee, meaning that data received
>>> from one input is FIFO. This means that if the same key is coming from
>>> multiple inputs than there is no ordering guarantee there, only inside one
>>> input.
>>>
>>> Gyula
>>>
>>> Welly Tambunan  ezt írta (időpont: 2015. júl. 3., P,
>>> 11:51):
>>>
 Hi Gyula,

 Thanks for your response.

 So if i use partitionBy then data point with the same will receive
 exactly by the same instance of operator ?


 Another question is if i execute reduce() operator on after
 partitionBy, will that reduce operator guarantee ordering within the same
 key ?


 Cheers

 On Fri, Jul 3, 2015 at 4:14 PM, Gyula Fóra 
 wrote:

> Hey!
>
> Both groupBy and partitionBy will trigger a shuffle over the network
> based on some key, assuring that elements with the same keys end up on the
> same downstream processing operator.
>
> The difference between the two is that groupBy in addition to this
> returns a GroupedDataStream which lets you execute some special 
> operations,
> such as key based rolling aggregates.
>
> PartitionBy is useful when you are using simple operators but still
> want to control the messages received by parallel instances (in a mapper
> for example).
>
> Cheers,
> Gyula
>
> tambunanw  ezt írta (időpont: 2015. júl. 3., P,
> 10:32):
>
>> Hi All,
>>
>> I'm trying to digest what's the difference between this two. From my
>> experience in Spark GroupBy will cause shuffling on the network. Is
>> that the
>> same case in Flink ?
>>
>> I've watch videos and read a couple docs about Flink that's actually
>> Flink
>> will compile the user code into it's own optimized graph structure so
>> i
>> think Flink engine will take care of this one ?
>>
>> From the docs for Partitioning
>>
>>
>> http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#partitioning
>>
>> Is that true that GroupBy is more advanced than PartitionBy ? Can
>> someone
>> elaborate ?
>>
>> I think this one is really confusing for me that come from Spark
>> world. Any
>> help would be really appreciated.
>>
>> Cheers
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Streaming-PartitionBy-vs-GroupBy-differences-tp1927.html
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>


 --
 Welly Tambunan
 Triplelands

 http://weltam.wordpress.com
 http://www.triplelands.com 

>>>
>>
>>
>> --
>> Welly Tambunan
>> Triplelands
>>
>> http://weltam.wordpress.com
>> http://www.triplelands.com 
>>
>


-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com