XGBoost4J: Portable Distributed XGboost in Flink

2016-03-14 Thread Tianqi Chen
Hi Flink Community:
I am sending this email to let you know we just release XGBoost4J which
also runs on Flink. In short, XGBoost is a machine learning package that is
used by more than half of the machine challenge winning solutions and is
already widely used in industry. The distributed version scale to billion
examples(10x faster than spark.mllib in the experiment) with fewer
resources (see .http://arxiv.org/abs/1603.02754)

   See our blogpost for more details
http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html
 We
would love to have you try it out and helo us to make it better.

Cheers


Re: asm IllegalArgumentException with 1.0.0

2016-03-14 Thread Zach Cox
Yes pretty much - we use sbt to run the job in a local environment, not
Intellij, but should be the same thing. We were also seeing that exception
running unit tests locally. We did not see the exception when assembling a
fat jar and submitting to a remote Flink cluster.

It seems like the flink-connector-elasticsearch jar should not have shaded
classes in it. Maybe that jar in maven central was built incorrectly?

We worked around this by just not depending on that elasticsearch connector
at all, since we wrote our own connector for Elasticsearch 2.x.

-Zach


On Mon, Mar 14, 2016 at 2:03 PM Andrew Whitaker <
andrew.whita...@braintreepayments.com> wrote:

> We're having the same issue (we also have a dependency on
> flink-connector-elasticsearch). It's only happening to us in IntelliJ
> though. Is this the case for you as well?
>
> On Thu, Mar 10, 2016 at 3:20 PM, Zach Cox  wrote:
>
>> After some poking around I noticed
>> that flink-connector-elasticsearch_2.10-1.0.0.jar contains shaded asm
>> classes. If I remove that dependency from my project then I do not get the
>> IllegalArgumentException.
>>
>>
>> On Thu, Mar 10, 2016 at 11:51 AM Zach Cox  wrote:
>>
>>> Here are the jars on the classpath when I try to run our Flink job in a
>>> local environment (via `sbt run`):
>>>
>>>
>>> https://gist.githubusercontent.com/zcox/0992aba1c517b51dc879/raw/7136ec034c2beef04bd65de9f125ce3796db511f/gistfile1.txt
>>>
>>> There are many transitive dependencies pulled in from internal library
>>> projects that probably need to be cleaned out. Maybe we are including
>>> something that conflicts? Or maybe something important is being excluded?
>>>
>>> Are the asm classes included in Flink jars in some shaded form?
>>>
>>> Thanks,
>>> Zach
>>>
>>>
>>> On Thu, Mar 10, 2016 at 5:06 AM Stephan Ewen  wrote:
>>>
 Dependency shading changed a bit between RC4 and RC5 - maybe a
 different minor ASM version is now included in the "test" scope.

 Can you share the dependencies of the problematic project?

 On Thu, Mar 10, 2016 at 12:26 AM, Zach Cox  wrote:

> I also noticed when I try to run this application in a local
> environment, I get the same IllegalArgumentException.
>
> When I assemble this application into a fat jar and run it on a Flink
> cluster using the CLI tools, it seems to run fine.
>
> Maybe my local classpath is missing something that is provided on the
> Flink task managers?
>
> -Zach
>
>
> On Wed, Mar 9, 2016 at 5:16 PM Zach Cox  wrote:
>
>> Hi - after upgrading to 1.0.0, I'm getting this exception now in a
>> unit test:
>>
>>IllegalArgumentException:   (null:-1)
>> org.apache.flink.shaded.org.objectweb.asm.ClassVisitor.(Unknown
>> Source)
>> org.apache.flink.shaded.org.objectweb.asm.ClassVisitor.(Unknown
>> Source)
>>
>> org.apache.flink.api.scala.InnerClosureFinder.(ClosureCleaner.scala:279)
>>
>> org.apache.flink.api.scala.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:95)
>>
>> org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:115)
>>
>> org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.scalaClean(StreamExecutionEnvironment.scala:568)
>>
>> org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.addSource(StreamExecutionEnvironment.scala:498)
>>
>> The line that causes that exception is just adding
>> a FlinkKafkaConsumer08 source.
>>
>> ClassVisitor [1] seems to throw that IllegalArgumentException when it
>> is not given a valid api version number, but InnerClosureFinder [2] looks
>> fine to me.
>>
>> Any idea what might be causing this? This unit test worked fine with
>> 1.0.0-rc0 jars.
>>
>> Thanks,
>> Zach
>>
>> [1]
>> http://websvn.ow2.org/filedetails.php?repname=asm=%2Ftrunk%2Fasm%2Fsrc%2Forg%2Fobjectweb%2Fasm%2FClassVisitor.java
>> [2]
>> https://github.com/apache/flink/blob/master/flink-scala/src/main/scala/org/apache/flink/api/scala/ClosureCleaner.scala#L279
>>
>>
>>

>
>
> --
> Andrew Whitaker | andrew.whita...@braintreepayments.com
> --
> Note: this information is confidential. It is prohibited to share, post
> online or otherwise publicize without Braintree's prior written consent.
>


Re: Memory ran out PageRank

2016-03-14 Thread Ovidiu-Cristian MARCU
Correction: successfully CC I am running is on top of your friend, Spark :)

Best,
Ovidiu
> On 14 Mar 2016, at 20:38, Ovidiu-Cristian MARCU 
>  wrote:
> 
> Yes, largely different. I was expecting for the solution set to be spillable.
> This is somehow very hard limitation, the layout of the data makes the 
> difference.
> 
> By contract, I am able to run successfully CC on the synthetic data but RDDs 
> are persisted in memory or on disk.
> 
> Best,
> Ovidiu
> 
>> On 14 Mar 2016, at 18:48, Ufuk Celebi  wrote:
>> 
>> Probably the limitation is that the number of keys is different in the
>> real and the synthetic data set respectively. Can you confirm this?
>> 
>> The solution set for delta iterations is currently implemented as an
>> in-memory hash table that works on managed memory segments, but is not
>> spillable.
>> 
>> – Ufuk
>> 
>> On Mon, Mar 14, 2016 at 6:30 PM, Ovidiu-Cristian MARCU
>>  wrote:
>>> 
>>> This problem is surprising as I was able to run PR and CC on a larger graph 
>>> (2bil edges) but with this synthetic graph (1bil edges groups of 10) I ran 
>>> out of memory; regarding configuration (memory and parallelism, other 
>>> internals) I was using the same.
>>> There is some limitation somewhere I will try to understand more what is 
>>> happening.
>>> 
>>> Best,
>>> Ovidiu
>>> 
 On 14 Mar 2016, at 18:06, Martin Junghanns  wrote:
 
 Hi,
 
 I understand the confusion. So far, I did not run into the problem, but I 
 think this needs to be adressed as all our graph processing abstractions 
 are implemented on top of the delta iteration.
 
 According to the previous mailing list discussion, the problem is with the 
 solution set and its missing ability to spill.
 
 If this is the still the case, we should open an issue for that. Any 
 further opinions on that?
 
 Cheers,
 Martin
 
 
 On 14.03.2016 17:55, Ovidiu-Cristian MARCU wrote:
> Thank you for this alternative.
> I don’t understand how the workaround will fix this on systems with 
> limited memory and maybe larger graph.
> 
> Running Connected Components on the same graph gives the same problem.
> 
> IterationHead(Unnamed Delta Iteration)(82/88) switched to FAILED
> java.lang.RuntimeException: Memory ran out. Compaction failed. 
> numPartitions: 32 minPartition: 31 maxPartition: 32 number of overflow 
> segments: 417 bucketSize: 827 Overall memory: 149159936 Partition memory: 
> 65601536 Message: Index: 32, Size: 31
>   at 
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
>   at 
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
>   at 
> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
>   at 
> org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
>   at 
> org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>   at java.lang.Thread.run(Thread.java:745)
> 
> Best,
> Ovidiu
> 
>> On 14 Mar 2016, at 17:36, Martin Junghanns  
>> wrote:
>> 
>> Hi
>> 
>> I think this is the same issue we had before on the list [1]. Stephan 
>> recommended the following workaround:
>> 
>>> A possible workaround is to use the option 
>>> "setSolutionSetUnmanaged(true)"
>>> on the iteration. That will eliminate the fragmentation issue, at least.
>> 
>> Unfortunately, you cannot set this when using graph.run(new 
>> PageRank(...))
>> 
>> I created a Gist which shows you how to set this using PageRank
>> 
>> https://gist.github.com/s1ck/801a8ef97ce374b358df
>> 
>> Please let us know if it worked out for you.
>> 
>> Cheers,
>> Martin
>> 
>> [1] 
>> http://mail-archives.apache.org/mod_mbox/flink-user/201508.mbox/%3CCAELUF_ByPAB%2BPXWLemPzRH%3D-awATeSz4sGz4v9TmnvFku3%3Dx3A%40mail.gmail.com%3E
>> 
>> On 14.03.2016 16:55, Ovidiu-Cristian MARCU wrote:
>>> Hi,
>>> 
>>> While running PageRank on a synthetic graph I run into this problem:
>>> Any advice on how should I proceed to overcome this memory issue?
>>> 
>>> IterationHead(Vertex-centric iteration 
>>> (org.apache.flink.graph.library.PageRank$VertexRankUpdater@7712cae0 | 
>>> org.apache.flink.graph.library.PageRank$RankMesseng$
>>> java.lang.RuntimeException: Memory ran out. Compaction 

Re: Memory ran out PageRank

2016-03-14 Thread Ovidiu-Cristian MARCU
Yes, largely different. I was expecting for the solution set to be spillable.
This is somehow very hard limitation, the layout of the data makes the 
difference.

By contract, I am able to run successfully CC on the synthetic data but RDDs 
are persisted in memory or on disk.

Best,
Ovidiu

> On 14 Mar 2016, at 18:48, Ufuk Celebi  wrote:
> 
> Probably the limitation is that the number of keys is different in the
> real and the synthetic data set respectively. Can you confirm this?
> 
> The solution set for delta iterations is currently implemented as an
> in-memory hash table that works on managed memory segments, but is not
> spillable.
> 
> – Ufuk
> 
> On Mon, Mar 14, 2016 at 6:30 PM, Ovidiu-Cristian MARCU
>  wrote:
>> 
>> This problem is surprising as I was able to run PR and CC on a larger graph 
>> (2bil edges) but with this synthetic graph (1bil edges groups of 10) I ran 
>> out of memory; regarding configuration (memory and parallelism, other 
>> internals) I was using the same.
>> There is some limitation somewhere I will try to understand more what is 
>> happening.
>> 
>> Best,
>> Ovidiu
>> 
>>> On 14 Mar 2016, at 18:06, Martin Junghanns  wrote:
>>> 
>>> Hi,
>>> 
>>> I understand the confusion. So far, I did not run into the problem, but I 
>>> think this needs to be adressed as all our graph processing abstractions 
>>> are implemented on top of the delta iteration.
>>> 
>>> According to the previous mailing list discussion, the problem is with the 
>>> solution set and its missing ability to spill.
>>> 
>>> If this is the still the case, we should open an issue for that. Any 
>>> further opinions on that?
>>> 
>>> Cheers,
>>> Martin
>>> 
>>> 
>>> On 14.03.2016 17:55, Ovidiu-Cristian MARCU wrote:
 Thank you for this alternative.
 I don’t understand how the workaround will fix this on systems with 
 limited memory and maybe larger graph.
 
 Running Connected Components on the same graph gives the same problem.
 
 IterationHead(Unnamed Delta Iteration)(82/88) switched to FAILED
 java.lang.RuntimeException: Memory ran out. Compaction failed. 
 numPartitions: 32 minPartition: 31 maxPartition: 32 number of overflow 
 segments: 417 bucketSize: 827 Overall memory: 149159936 Partition memory: 
 65601536 Message: Index: 32, Size: 31
at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
at 
 org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
at 
 org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
at 
 org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
at java.lang.Thread.run(Thread.java:745)
 
 Best,
 Ovidiu
 
> On 14 Mar 2016, at 17:36, Martin Junghanns  
> wrote:
> 
> Hi
> 
> I think this is the same issue we had before on the list [1]. Stephan 
> recommended the following workaround:
> 
>> A possible workaround is to use the option 
>> "setSolutionSetUnmanaged(true)"
>> on the iteration. That will eliminate the fragmentation issue, at least.
> 
> Unfortunately, you cannot set this when using graph.run(new PageRank(...))
> 
> I created a Gist which shows you how to set this using PageRank
> 
> https://gist.github.com/s1ck/801a8ef97ce374b358df
> 
> Please let us know if it worked out for you.
> 
> Cheers,
> Martin
> 
> [1] 
> http://mail-archives.apache.org/mod_mbox/flink-user/201508.mbox/%3CCAELUF_ByPAB%2BPXWLemPzRH%3D-awATeSz4sGz4v9TmnvFku3%3Dx3A%40mail.gmail.com%3E
> 
> On 14.03.2016 16:55, Ovidiu-Cristian MARCU wrote:
>> Hi,
>> 
>> While running PageRank on a synthetic graph I run into this problem:
>> Any advice on how should I proceed to overcome this memory issue?
>> 
>> IterationHead(Vertex-centric iteration 
>> (org.apache.flink.graph.library.PageRank$VertexRankUpdater@7712cae0 | 
>> org.apache.flink.graph.library.PageRank$RankMesseng$
>> java.lang.RuntimeException: Memory ran out. Compaction failed. 
>> numPartitions: 32 minPartition: 24 maxPartition: 25 number of overflow 
>> segments: 328 bucketSize: 638 Overall memory: 115539968 Partition 
>> memory: 50659328 Message: Index: 25, Size: 24
>>at 
>> 

Re: asm IllegalArgumentException with 1.0.0

2016-03-14 Thread Andrew Whitaker
We're having the same issue (we also have a dependency on
flink-connector-elasticsearch). It's only happening to us in IntelliJ
though. Is this the case for you as well?

On Thu, Mar 10, 2016 at 3:20 PM, Zach Cox  wrote:

> After some poking around I noticed
> that flink-connector-elasticsearch_2.10-1.0.0.jar contains shaded asm
> classes. If I remove that dependency from my project then I do not get the
> IllegalArgumentException.
>
>
> On Thu, Mar 10, 2016 at 11:51 AM Zach Cox  wrote:
>
>> Here are the jars on the classpath when I try to run our Flink job in a
>> local environment (via `sbt run`):
>>
>>
>> https://gist.githubusercontent.com/zcox/0992aba1c517b51dc879/raw/7136ec034c2beef04bd65de9f125ce3796db511f/gistfile1.txt
>>
>> There are many transitive dependencies pulled in from internal library
>> projects that probably need to be cleaned out. Maybe we are including
>> something that conflicts? Or maybe something important is being excluded?
>>
>> Are the asm classes included in Flink jars in some shaded form?
>>
>> Thanks,
>> Zach
>>
>>
>> On Thu, Mar 10, 2016 at 5:06 AM Stephan Ewen  wrote:
>>
>>> Dependency shading changed a bit between RC4 and RC5 - maybe a different
>>> minor ASM version is now included in the "test" scope.
>>>
>>> Can you share the dependencies of the problematic project?
>>>
>>> On Thu, Mar 10, 2016 at 12:26 AM, Zach Cox  wrote:
>>>
 I also noticed when I try to run this application in a local
 environment, I get the same IllegalArgumentException.

 When I assemble this application into a fat jar and run it on a Flink
 cluster using the CLI tools, it seems to run fine.

 Maybe my local classpath is missing something that is provided on the
 Flink task managers?

 -Zach


 On Wed, Mar 9, 2016 at 5:16 PM Zach Cox  wrote:

> Hi - after upgrading to 1.0.0, I'm getting this exception now in a
> unit test:
>
>IllegalArgumentException:   (null:-1)
> org.apache.flink.shaded.org.objectweb.asm.ClassVisitor.(Unknown
> Source)
> org.apache.flink.shaded.org.objectweb.asm.ClassVisitor.(Unknown
> Source)
>
> org.apache.flink.api.scala.InnerClosureFinder.(ClosureCleaner.scala:279)
>
> org.apache.flink.api.scala.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:95)
>
> org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:115)
>
> org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.scalaClean(StreamExecutionEnvironment.scala:568)
>
> org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.addSource(StreamExecutionEnvironment.scala:498)
>
> The line that causes that exception is just adding
> a FlinkKafkaConsumer08 source.
>
> ClassVisitor [1] seems to throw that IllegalArgumentException when it
> is not given a valid api version number, but InnerClosureFinder [2] looks
> fine to me.
>
> Any idea what might be causing this? This unit test worked fine with
> 1.0.0-rc0 jars.
>
> Thanks,
> Zach
>
> [1]
> http://websvn.ow2.org/filedetails.php?repname=asm=%2Ftrunk%2Fasm%2Fsrc%2Forg%2Fobjectweb%2Fasm%2FClassVisitor.java
> [2]
> https://github.com/apache/flink/blob/master/flink-scala/src/main/scala/org/apache/flink/api/scala/ClosureCleaner.scala#L279
>
>
>
>>>


-- 
Andrew Whitaker | andrew.whita...@braintreepayments.com
--
Note: this information is confidential. It is prohibited to share, post
online or otherwise publicize without Braintree's prior written consent.


Re: Memory ran out PageRank

2016-03-14 Thread Ufuk Celebi
Probably the limitation is that the number of keys is different in the
real and the synthetic data set respectively. Can you confirm this?

The solution set for delta iterations is currently implemented as an
in-memory hash table that works on managed memory segments, but is not
spillable.

– Ufuk

On Mon, Mar 14, 2016 at 6:30 PM, Ovidiu-Cristian MARCU
 wrote:
>
> This problem is surprising as I was able to run PR and CC on a larger graph 
> (2bil edges) but with this synthetic graph (1bil edges groups of 10) I ran 
> out of memory; regarding configuration (memory and parallelism, other 
> internals) I was using the same.
> There is some limitation somewhere I will try to understand more what is 
> happening.
>
> Best,
> Ovidiu
>
>> On 14 Mar 2016, at 18:06, Martin Junghanns  wrote:
>>
>> Hi,
>>
>> I understand the confusion. So far, I did not run into the problem, but I 
>> think this needs to be adressed as all our graph processing abstractions are 
>> implemented on top of the delta iteration.
>>
>> According to the previous mailing list discussion, the problem is with the 
>> solution set and its missing ability to spill.
>>
>> If this is the still the case, we should open an issue for that. Any further 
>> opinions on that?
>>
>> Cheers,
>> Martin
>>
>>
>> On 14.03.2016 17:55, Ovidiu-Cristian MARCU wrote:
>>> Thank you for this alternative.
>>> I don’t understand how the workaround will fix this on systems with limited 
>>> memory and maybe larger graph.
>>>
>>> Running Connected Components on the same graph gives the same problem.
>>>
>>> IterationHead(Unnamed Delta Iteration)(82/88) switched to FAILED
>>> java.lang.RuntimeException: Memory ran out. Compaction failed. 
>>> numPartitions: 32 minPartition: 31 maxPartition: 32 number of overflow 
>>> segments: 417 bucketSize: 827 Overall memory: 149159936 Partition memory: 
>>> 65601536 Message: Index: 32, Size: 31
>>> at 
>>> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
>>> at 
>>> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
>>> at 
>>> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
>>> at 
>>> org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
>>> at 
>>> org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
>>> at 
>>> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Best,
>>> Ovidiu
>>>
 On 14 Mar 2016, at 17:36, Martin Junghanns  wrote:

 Hi

 I think this is the same issue we had before on the list [1]. Stephan 
 recommended the following workaround:

> A possible workaround is to use the option "setSolutionSetUnmanaged(true)"
> on the iteration. That will eliminate the fragmentation issue, at least.

 Unfortunately, you cannot set this when using graph.run(new PageRank(...))

 I created a Gist which shows you how to set this using PageRank

 https://gist.github.com/s1ck/801a8ef97ce374b358df

 Please let us know if it worked out for you.

 Cheers,
 Martin

 [1] 
 http://mail-archives.apache.org/mod_mbox/flink-user/201508.mbox/%3CCAELUF_ByPAB%2BPXWLemPzRH%3D-awATeSz4sGz4v9TmnvFku3%3Dx3A%40mail.gmail.com%3E

 On 14.03.2016 16:55, Ovidiu-Cristian MARCU wrote:
> Hi,
>
> While running PageRank on a synthetic graph I run into this problem:
> Any advice on how should I proceed to overcome this memory issue?
>
> IterationHead(Vertex-centric iteration 
> (org.apache.flink.graph.library.PageRank$VertexRankUpdater@7712cae0 | 
> org.apache.flink.graph.library.PageRank$RankMesseng$
> java.lang.RuntimeException: Memory ran out. Compaction failed. 
> numPartitions: 32 minPartition: 24 maxPartition: 25 number of overflow 
> segments: 328 bucketSize: 638 Overall memory: 115539968 Partition memory: 
> 50659328 Message: Index: 25, Size: 24
> at 
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
> at 
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
> at 
> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
> at 
> org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
> at 
> 

Re: Memory ran out PageRank

2016-03-14 Thread Ovidiu-Cristian MARCU

This problem is surprising as I was able to run PR and CC on a larger graph 
(2bil edges) but with this synthetic graph (1bil edges groups of 10) I ran out 
of memory; regarding configuration (memory and parallelism, other internals) I 
was using the same.
There is some limitation somewhere I will try to understand more what is 
happening.

Best,
Ovidiu

> On 14 Mar 2016, at 18:06, Martin Junghanns  wrote:
> 
> Hi,
> 
> I understand the confusion. So far, I did not run into the problem, but I 
> think this needs to be adressed as all our graph processing abstractions are 
> implemented on top of the delta iteration.
> 
> According to the previous mailing list discussion, the problem is with the 
> solution set and its missing ability to spill.
> 
> If this is the still the case, we should open an issue for that. Any further 
> opinions on that?
> 
> Cheers,
> Martin
> 
> 
> On 14.03.2016 17:55, Ovidiu-Cristian MARCU wrote:
>> Thank you for this alternative.
>> I don’t understand how the workaround will fix this on systems with limited 
>> memory and maybe larger graph.
>> 
>> Running Connected Components on the same graph gives the same problem.
>> 
>> IterationHead(Unnamed Delta Iteration)(82/88) switched to FAILED
>> java.lang.RuntimeException: Memory ran out. Compaction failed. 
>> numPartitions: 32 minPartition: 31 maxPartition: 32 number of overflow 
>> segments: 417 bucketSize: 827 Overall memory: 149159936 Partition memory: 
>> 65601536 Message: Index: 32, Size: 31
>> at 
>> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
>> at 
>> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
>> at 
>> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
>> at 
>> org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
>> at 
>> org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
>> at 
>> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>> at java.lang.Thread.run(Thread.java:745)
>> 
>> Best,
>> Ovidiu
>> 
>>> On 14 Mar 2016, at 17:36, Martin Junghanns  wrote:
>>> 
>>> Hi
>>> 
>>> I think this is the same issue we had before on the list [1]. Stephan 
>>> recommended the following workaround:
>>> 
 A possible workaround is to use the option "setSolutionSetUnmanaged(true)"
 on the iteration. That will eliminate the fragmentation issue, at least.
>>> 
>>> Unfortunately, you cannot set this when using graph.run(new PageRank(...))
>>> 
>>> I created a Gist which shows you how to set this using PageRank
>>> 
>>> https://gist.github.com/s1ck/801a8ef97ce374b358df
>>> 
>>> Please let us know if it worked out for you.
>>> 
>>> Cheers,
>>> Martin
>>> 
>>> [1] 
>>> http://mail-archives.apache.org/mod_mbox/flink-user/201508.mbox/%3CCAELUF_ByPAB%2BPXWLemPzRH%3D-awATeSz4sGz4v9TmnvFku3%3Dx3A%40mail.gmail.com%3E
>>> 
>>> On 14.03.2016 16:55, Ovidiu-Cristian MARCU wrote:
 Hi,
 
 While running PageRank on a synthetic graph I run into this problem:
 Any advice on how should I proceed to overcome this memory issue?
 
 IterationHead(Vertex-centric iteration 
 (org.apache.flink.graph.library.PageRank$VertexRankUpdater@7712cae0 | 
 org.apache.flink.graph.library.PageRank$RankMesseng$
 java.lang.RuntimeException: Memory ran out. Compaction failed. 
 numPartitions: 32 minPartition: 24 maxPartition: 25 number of overflow 
 segments: 328 bucketSize: 638 Overall memory: 115539968 Partition memory: 
 50659328 Message: Index: 25, Size: 24
 at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
 at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
 at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
 at 
 org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
 at 
 org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
 at 
 org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
 at java.lang.Thread.run(Thread.java:745)
 
 Thanks!
 
 Best,
 Ovidiu
 
>> 
>> 



Re: Behavior of SlidingProessingTimeWindow with CountTrigger

2016-03-14 Thread Vishnu Viswanath
Hi Aljoscha,

Thank you for the explanation and the link on IBM infosphere. That explains
whey I am seeing (a,3) and (b,3) in my example.

Yes, the name Evictor is confusing.

Thanks and Regards,
Vishnu Viswanath,
www.vishnuviswanath.com

On Mon, Mar 14, 2016 at 11:24 AM, Aljoscha Krettek 
wrote:

Hi,
> sure, the evictors are a bit confusing (especially the fact that they are
> called evictors). They should more correctly called “Keepers”. The process
> is the following:
>
> 1. Trigger Fires
> 2. Evictor decides what elements to keep, so a CountEvictor.of(3) says,
> keep only three elements, all others are evicted
> 3. Elements that remain after evictor are used for processing
>
> We mostly have Evictors for legacy reasons nowadays since the original
> window implementation was based on ideas in IBM InfoSphere streams. See
> this part of their documentation for some explanation:
> https://www.ibm.com/support/knowledgecenter/#!/SSCRJU_4.0.0/com.ibm.streams.dev.doc/doc/windowhandling.html
>
> - aljoscha
> > On 14 Mar 2016, at 17:04, Vishnu Viswanath 
> wrote:
> >
> > Hi Aijoscha,
> >
> > Wow, great illustration.
> >
> > That was very clear explanation. Yes, I did enter the elements fast for
> case b and I was seeing more of case As.
> > Also, sometimes I have seen a window getting triggered when I enter 1 or
> 2 elements, I believe that is expansion of case A, w.r.t to window 2.
> >
> > Also can you explain me the case when using Evictor.
> > e.g.,
> >
> >
> > val counts = socTextStream.flatMap{_.split("\\s")}
> >   .map { (_, 1) }
> >   .keyBy(0)
> >
>  .window(SlidingProcessingTimeWindows.of(Time.seconds(20),Time.seconds(10)))
> >   .trigger(CountTrigger.of(5))
> >   .evictor(CountEvictor.of(3))
> >   .sum(1).setParallelism(4);
> >
> > counts.print()
> > sev.execute()
> >
> > for the input
> >
> >
> > a
> >
> > a
> >
> > a
> >
> > a
> >
> > a
> >
> > b
> >
> > b
> >
> > b
> >
> > b
> >
> > b
> >
> > I got the output as
> >
> >
> > 1> (a,3)
> >
> > 1> (b,3)
> >
> > 2> (b,3)
> >
> > My assumption was that, when the Trigger is triggered, the processing
> will be done on the entire items in the window,
> >
> > and then 3 items will be evicted from the window, which can also be part
> of the next processing of that window. But
> >
> > here it looks like  the sum is calculated only on the items that were
> evicted from the window.
> >
> > Could you please explain what is going on here.
> >
> >
> >
> > Thanks and Regards,
> > Vishnu Viswanath,
> > www.vishnuviswanath.com
> >
> > On Mon, Mar 14, 2016 at 5:27 AM, Aljoscha Krettek 
> wrote:
> > Hi,
> > I created a visualization to help explain the situation:
> http://s21.postimg.org/dofhcw52f/window_example.png
> > 
> > The SlidingProcessingTimeWindows assigner assigns elements to windows
> based on the current processing time. The CountTrigger only fires if a
> window contains 5 elements (or more). In your test the windows for a, c and
> e fell into case b because you probably entered the letters very fast. For
> elements  b and d we have case a The elements were far enough apart or you
> happened to enter them right on a window boundary such that only one window
> contains all of them. The other windows don’t contain enough elements to
> reach 5. In my drawing window 1 contains 5 elements while window 2 only
> contains 3 of those elements.
> >
> > I hope this helps.
> >
> > Cheers,
> > Aljoscha
> >> On 12 Mar 2016, at 19:19, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
> >>
> >> Hi All,
> >>
> >>
> >> I have the below code
> >>
> >>
> >> val sev = StreamExecutionEnvironment.getExecutionEnvironment
> >> val socTextStream = sev.socketTextStream("localhost",)
> >>
> >> val counts = socTextStream.flatMap{_.split("\\s")}
> >>   .map { (_, 1) }
> >>   .keyBy(0)
> >>
>  .window(SlidingProcessingTimeWindows.of(Time.seconds(20),Time.seconds(10)))
> >>   .trigger(CountTrigger.of(5))
> >>   .sum(1)
> >>
> >> counts.print()
> >> sev.execute()
> >>
> >> I am sending messages to the port  using nc -lk 
> >> This is my sample input
> >>
> >> a
> >> a
> >> a
> >> a
> >> a
> >> b
> >> b
> >> b
> >> b
> >> b
> >> c
> >> c
> >> c
> >> c
> >> c
> >> d
> >> d
> >> d
> >> d
> >> d
> >> e
> >> e
> >> e
> >> e
> >> e
> >>
> >> I am sending 5 of each letter since I have a Count Trigger of 5. I was
> expecting that for each 5 character, the code will print 5, i.e., (a,5)
> (b,5) etc. But the output I am getting is little confusing.
> >> Output:
> >>
> >> 1> (a,5)
> >> 1> (a,5)
> >> 1> (b,5)
> >> 2> (c,5)
> >> 2> (c,5)
> >> 1> (d,5)
> >> 1> (e,5)
> >> 1> (e,5)
> >>
> >> As you can see, for some character the count is printed twice(a,c,e)
> and for some characters it is printed only once (b,d). I am not able to
> figure out what is going on. I think it may have something to do with the
> SlidingProcessingTimeWindow but I am not sure.
> >> Can someone explain me what is going on?
> >>
> >>
> >> 

Re: Memory ran out PageRank

2016-03-14 Thread Martin Junghanns

Hi,

I understand the confusion. So far, I did not run into the problem, but 
I think this needs to be adressed as all our graph processing 
abstractions are implemented on top of the delta iteration.


According to the previous mailing list discussion, the problem is with 
the solution set and its missing ability to spill.


If this is the still the case, we should open an issue for that. Any 
further opinions on that?


Cheers,
Martin


On 14.03.2016 17:55, Ovidiu-Cristian MARCU wrote:

Thank you for this alternative.
I don’t understand how the workaround will fix this on systems with limited 
memory and maybe larger graph.

Running Connected Components on the same graph gives the same problem.

IterationHead(Unnamed Delta Iteration)(82/88) switched to FAILED
java.lang.RuntimeException: Memory ran out. Compaction failed. numPartitions: 
32 minPartition: 31 maxPartition: 32 number of overflow segments: 417 
bucketSize: 827 Overall memory: 149159936 Partition memory: 65601536 Message: 
Index: 32, Size: 31
 at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
 at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
 at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
 at 
org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
 at 
org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
 at 
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
 at java.lang.Thread.run(Thread.java:745)

Best,
Ovidiu


On 14 Mar 2016, at 17:36, Martin Junghanns  wrote:

Hi

I think this is the same issue we had before on the list [1]. Stephan 
recommended the following workaround:


A possible workaround is to use the option "setSolutionSetUnmanaged(true)"
on the iteration. That will eliminate the fragmentation issue, at least.


Unfortunately, you cannot set this when using graph.run(new PageRank(...))

I created a Gist which shows you how to set this using PageRank

https://gist.github.com/s1ck/801a8ef97ce374b358df

Please let us know if it worked out for you.

Cheers,
Martin

[1] 
http://mail-archives.apache.org/mod_mbox/flink-user/201508.mbox/%3CCAELUF_ByPAB%2BPXWLemPzRH%3D-awATeSz4sGz4v9TmnvFku3%3Dx3A%40mail.gmail.com%3E

On 14.03.2016 16:55, Ovidiu-Cristian MARCU wrote:

Hi,

While running PageRank on a synthetic graph I run into this problem:
Any advice on how should I proceed to overcome this memory issue?

IterationHead(Vertex-centric iteration 
(org.apache.flink.graph.library.PageRank$VertexRankUpdater@7712cae0 | 
org.apache.flink.graph.library.PageRank$RankMesseng$
java.lang.RuntimeException: Memory ran out. Compaction failed. numPartitions: 
32 minPartition: 24 maxPartition: 25 number of overflow segments: 328 
bucketSize: 638 Overall memory: 115539968 Partition memory: 50659328 Message: 
Index: 25, Size: 24
 at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
 at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
 at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
 at 
org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
 at 
org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
 at 
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
 at java.lang.Thread.run(Thread.java:745)

Thanks!

Best,
Ovidiu






Re: Memory ran out PageRank

2016-03-14 Thread Vasiliki Kalavri
Hi Ovidiu,

this option won't fix the problem if your system doesn't have enough memory
:)
It only defines whether the solution set is kept in managed memory or not.
For more iteration configuration options, take a look at the Gelly
documentation [1].

-Vasia.

[1]:
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html#configuring-a-scatter-gather-iteration

On 14 March 2016 at 17:55, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Thank you for this alternative.
> I don’t understand how the workaround will fix this on systems with
> limited memory and maybe larger graph.
>
> Running Connected Components on the same graph gives the same problem.
>
> IterationHead(Unnamed Delta Iteration)(82/88) switched to FAILED
> java.lang.RuntimeException: Memory ran out. Compaction failed.
> numPartitions: 32 minPartition: 31 maxPartition: 32 number of overflow
> segments: 417 bucketSize: 827 Overall memory: 149159936 Partition memory:
> 65601536 Message: Index: 32, Size: 31
> at
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
> at
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
> at
> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
> at
> org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
> at
> org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
> at
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
> at java.lang.Thread.run(Thread.java:745)
>
> Best,
> Ovidiu
>
> On 14 Mar 2016, at 17:36, Martin Junghanns 
> wrote:
>
> Hi
>
> I think this is the same issue we had before on the list [1]. Stephan
> recommended the following workaround:
>
> A possible workaround is to use the option "setSolutionSetUnmanaged(true)"
> on the iteration. That will eliminate the fragmentation issue, at least.
>
>
> Unfortunately, you cannot set this when using graph.run(new PageRank(...))
>
> I created a Gist which shows you how to set this using PageRank
>
> https://gist.github.com/s1ck/801a8ef97ce374b358df
>
> Please let us know if it worked out for you.
>
> Cheers,
> Martin
>
> [1]
> http://mail-archives.apache.org/mod_mbox/flink-user/201508.mbox/%3CCAELUF_ByPAB%2BPXWLemPzRH%3D-awATeSz4sGz4v9TmnvFku3%3Dx3A%40mail.gmail.com%3E
>
> On 14.03.2016 16:55, Ovidiu-Cristian MARCU wrote:
>
> Hi,
>
> While running PageRank on a synthetic graph I run into this problem:
> Any advice on how should I proceed to overcome this memory issue?
>
> IterationHead(Vertex-centric iteration
> (org.apache.flink.graph.library.PageRank$VertexRankUpdater@7712cae0 |
> org.apache.flink.graph.library.PageRank$RankMesseng$
> java.lang.RuntimeException: Memory ran out. Compaction failed.
> numPartitions: 32 minPartition: 24 maxPartition: 25 number of overflow
> segments: 328 bucketSize: 638 Overall memory: 115539968 Partition memory:
> 50659328 Message: Index: 25, Size: 24
> at
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
> at
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
> at
> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
> at
> org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
> at
> org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
> at
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
> at java.lang.Thread.run(Thread.java:745)
>
> Thanks!
>
> Best,
> Ovidiu
>
>
>


Re: Memory ran out PageRank

2016-03-14 Thread Ovidiu-Cristian MARCU
Thank you for this alternative.
I don’t understand how the workaround will fix this on systems with limited 
memory and maybe larger graph.

Running Connected Components on the same graph gives the same problem.

IterationHead(Unnamed Delta Iteration)(82/88) switched to FAILED
java.lang.RuntimeException: Memory ran out. Compaction failed. numPartitions: 
32 minPartition: 31 maxPartition: 32 number of overflow segments: 417 
bucketSize: 827 Overall memory: 149159936 Partition memory: 65601536 Message: 
Index: 32, Size: 31
at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
at 
org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
at 
org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
at 
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
at java.lang.Thread.run(Thread.java:745)

Best,
Ovidiu

> On 14 Mar 2016, at 17:36, Martin Junghanns  wrote:
> 
> Hi
> 
> I think this is the same issue we had before on the list [1]. Stephan 
> recommended the following workaround:
> 
>> A possible workaround is to use the option "setSolutionSetUnmanaged(true)"
>> on the iteration. That will eliminate the fragmentation issue, at least.
> 
> Unfortunately, you cannot set this when using graph.run(new PageRank(...))
> 
> I created a Gist which shows you how to set this using PageRank
> 
> https://gist.github.com/s1ck/801a8ef97ce374b358df
> 
> Please let us know if it worked out for you.
> 
> Cheers,
> Martin
> 
> [1] 
> http://mail-archives.apache.org/mod_mbox/flink-user/201508.mbox/%3CCAELUF_ByPAB%2BPXWLemPzRH%3D-awATeSz4sGz4v9TmnvFku3%3Dx3A%40mail.gmail.com%3E
> 
> On 14.03.2016 16:55, Ovidiu-Cristian MARCU wrote:
>> Hi,
>> 
>> While running PageRank on a synthetic graph I run into this problem:
>> Any advice on how should I proceed to overcome this memory issue?
>> 
>> IterationHead(Vertex-centric iteration 
>> (org.apache.flink.graph.library.PageRank$VertexRankUpdater@7712cae0 | 
>> org.apache.flink.graph.library.PageRank$RankMesseng$
>> java.lang.RuntimeException: Memory ran out. Compaction failed. 
>> numPartitions: 32 minPartition: 24 maxPartition: 25 number of overflow 
>> segments: 328 bucketSize: 638 Overall memory: 115539968 Partition memory: 
>> 50659328 Message: Index: 25, Size: 24
>> at 
>> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
>> at 
>> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
>> at 
>> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
>> at 
>> org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
>> at 
>> org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
>> at 
>> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>> at java.lang.Thread.run(Thread.java:745)
>> 
>> Thanks!
>> 
>> Best,
>> Ovidiu
>> 



Re: Memory ran out PageRank

2016-03-14 Thread Martin Junghanns

Hi

I think this is the same issue we had before on the list [1]. Stephan 
recommended the following workaround:



A possible workaround is to use the option "setSolutionSetUnmanaged(true)"
on the iteration. That will eliminate the fragmentation issue, at least.


Unfortunately, you cannot set this when using graph.run(new PageRank(...))

I created a Gist which shows you how to set this using PageRank

https://gist.github.com/s1ck/801a8ef97ce374b358df

Please let us know if it worked out for you.

Cheers,
Martin

[1] 
http://mail-archives.apache.org/mod_mbox/flink-user/201508.mbox/%3CCAELUF_ByPAB%2BPXWLemPzRH%3D-awATeSz4sGz4v9TmnvFku3%3Dx3A%40mail.gmail.com%3E


On 14.03.2016 16:55, Ovidiu-Cristian MARCU wrote:

Hi,

While running PageRank on a synthetic graph I run into this problem:
Any advice on how should I proceed to overcome this memory issue?

IterationHead(Vertex-centric iteration 
(org.apache.flink.graph.library.PageRank$VertexRankUpdater@7712cae0 | 
org.apache.flink.graph.library.PageRank$RankMesseng$
java.lang.RuntimeException: Memory ran out. Compaction failed. numPartitions: 
32 minPartition: 24 maxPartition: 25 number of overflow segments: 328 
bucketSize: 638 Overall memory: 115539968 Partition memory: 50659328 Message: 
Index: 25, Size: 24
 at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
 at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
 at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
 at 
org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
 at 
org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
 at 
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
 at java.lang.Thread.run(Thread.java:745)

Thanks!

Best,
Ovidiu



Re: Integration Alluxio and Flink

2016-03-14 Thread Till Rohrmann
Hi Andrea,

the problem won’t be netty-all but netty, I suspect. Flink is using version
3.8 whereas alluxio-core-client uses version 3.2.2. I think you have to
exclude or shade this dependency away.

Cheers,
Till
​

On Mon, Mar 14, 2016 at 5:12 PM, Andrea Sella 
wrote:

> Hi Till,
> I tried to downgrade the Alluxio's netty version from 4.0.28.Final to
> 4.0.27.Final to align Flink and Alluxio dependencies. First of all, Flink
> 1.0.0 uses 4.0.27.Final, is it correct? Btw it doesn't work, same error as
> above.
>
> BR,
> Andrea
>
> 2016-03-14 15:30 GMT+01:00 Till Rohrmann :
>
>> Yes it seems as if you have a netty version conflict. Maybe the
>> alluxio-core-client.jar pulls in an incompatible netty version. Could you
>> check whether this is the case? But maybe you also have another
>> dependencies which pulls in a wrong netty version, since the Alluxio
>> documentation indicates that it works with Flink (but I cannot tell for
>> which version).
>>
>> Cheers,
>> Till
>>
>> On Mon, Mar 14, 2016 at 3:18 PM, Andrea Sella > > wrote:
>>
>>> Hi to all,
>>>
>>> I'm trying to integrate Alluxio and Apache Flink, I followed Running
>>> Flink on Alluxio
>>>  to
>>> setup Flink.
>>>
>>> I tested in local mode executing:
>>>
>>> bin/flink run ./examples/batch/WordCount.jar --input
>>> alluxio:///flink/README.txt
>>>
>>> But I've faced a TimeoutException, I attach the logs. It seems the
>>> trouble is due to netty dependencies-conflict.
>>>
>>> Thank you,
>>> Andrea
>>>
>>>
>>>
>>
>


Re: Behavior of SlidingProessingTimeWindow with CountTrigger

2016-03-14 Thread Aljoscha Krettek
Hi,
sure, the evictors are a bit confusing (especially the fact that they are 
called evictors). They should more correctly called “Keepers”. The process is 
the following:

1. Trigger Fires
2. Evictor decides what elements to keep, so a CountEvictor.of(3) says, keep 
only three elements, all others are evicted
3. Elements that remain after evictor are used for processing

We mostly have Evictors for legacy reasons nowadays since the original window 
implementation was based on ideas in IBM InfoSphere streams. See this part of 
their documentation for some explanation: 
https://www.ibm.com/support/knowledgecenter/#!/SSCRJU_4.0.0/com.ibm.streams.dev.doc/doc/windowhandling.html

- aljoscha
> On 14 Mar 2016, at 17:04, Vishnu Viswanath  
> wrote:
> 
> Hi Aijoscha,
> 
> Wow, great illustration.
> 
> That was very clear explanation. Yes, I did enter the elements fast for case 
> b and I was seeing more of case As.
> Also, sometimes I have seen a window getting triggered when I enter 1 or 2 
> elements, I believe that is expansion of case A, w.r.t to window 2.
> 
> Also can you explain me the case when using Evictor.
> e.g.,
> 
> 
> val counts = socTextStream.flatMap{_.split("\\s")}
>   .map { (_, 1) }
>   .keyBy(0)
>   .window(SlidingProcessingTimeWindows.of(Time.seconds(20),Time.seconds(10)))
>   .trigger(CountTrigger.of(5))
>   .evictor(CountEvictor.of(3))
>   .sum(1).setParallelism(4);
> 
> counts.print()
> sev.execute()
> 
> for the input
> 
> 
> a
> 
> a
> 
> a
> 
> a
> 
> a
> 
> b
> 
> b
> 
> b
> 
> b
> 
> b
> 
> I got the output as
> 
> 
> 1> (a,3)
> 
> 1> (b,3)
> 
> 2> (b,3)
> 
> My assumption was that, when the Trigger is triggered, the processing will be 
> done on the entire items in the window,
> 
> and then 3 items will be evicted from the window, which can also be part of 
> the next processing of that window. But
> 
> here it looks like  the sum is calculated only on the items that were evicted 
> from the window.
> 
> Could you please explain what is going on here.
> 
> 
> 
> Thanks and Regards,
> Vishnu Viswanath,
> www.vishnuviswanath.com
> 
> On Mon, Mar 14, 2016 at 5:27 AM, Aljoscha Krettek  wrote:
> Hi,
> I created a visualization to help explain the situation: 
> http://s21.postimg.org/dofhcw52f/window_example.png
> 
> The SlidingProcessingTimeWindows assigner assigns elements to windows based 
> on the current processing time. The CountTrigger only fires if a window 
> contains 5 elements (or more). In your test the windows for a, c and e fell 
> into case b because you probably entered the letters very fast. For elements  
> b and d we have case a The elements were far enough apart or you happened to 
> enter them right on a window boundary such that only one window contains all 
> of them. The other windows don’t contain enough elements to reach 5. In my 
> drawing window 1 contains 5 elements while window 2 only contains 3 of those 
> elements.
> 
> I hope this helps.
> 
> Cheers,
> Aljoscha
>> On 12 Mar 2016, at 19:19, Vishnu Viswanath  
>> wrote:
>> 
>> Hi All,
>> 
>> 
>> I have the below code
>> 
>> 
>> val sev = StreamExecutionEnvironment.getExecutionEnvironment
>> val socTextStream = sev.socketTextStream("localhost",)
>> 
>> val counts = socTextStream.flatMap{_.split("\\s")}
>>   .map { (_, 1) }
>>   .keyBy(0)
>>   .window(SlidingProcessingTimeWindows.of(Time.seconds(20),Time.seconds(10)))
>>   .trigger(CountTrigger.of(5))
>>   .sum(1)
>> 
>> counts.print()
>> sev.execute()
>> 
>> I am sending messages to the port  using nc -lk 
>> This is my sample input
>> 
>> a
>> a
>> a
>> a
>> a
>> b
>> b
>> b
>> b
>> b
>> c
>> c
>> c
>> c
>> c
>> d
>> d
>> d
>> d
>> d
>> e
>> e
>> e
>> e
>> e
>> 
>> I am sending 5 of each letter since I have a Count Trigger of 5. I was 
>> expecting that for each 5 character, the code will print 5, i.e., (a,5) 
>> (b,5) etc. But the output I am getting is little confusing.
>> Output:
>> 
>> 1> (a,5)
>> 1> (a,5)
>> 1> (b,5)
>> 2> (c,5)
>> 2> (c,5)
>> 1> (d,5)
>> 1> (e,5)
>> 1> (e,5)
>> 
>> As you can see, for some character the count is printed twice(a,c,e) and for 
>> some characters it is printed only once (b,d). I am not able to figure out 
>> what is going on. I think it may have something to do with the 
>> SlidingProcessingTimeWindow but I am not sure.
>> Can someone explain me what is going on?
>> 
>> 
>> Thanks and Regards,
>> Vishnu Viswanath
>> www.vishnuviswanath.com
>> 
> 
> 



Re: Integration Alluxio and Flink

2016-03-14 Thread Andrea Sella
Hi Till,
I tried to downgrade the Alluxio's netty version from 4.0.28.Final to
4.0.27.Final to align Flink and Alluxio dependencies. First of all, Flink
1.0.0 uses 4.0.27.Final, is it correct? Btw it doesn't work, same error as
above.

BR,
Andrea

2016-03-14 15:30 GMT+01:00 Till Rohrmann :

> Yes it seems as if you have a netty version conflict. Maybe the
> alluxio-core-client.jar pulls in an incompatible netty version. Could you
> check whether this is the case? But maybe you also have another
> dependencies which pulls in a wrong netty version, since the Alluxio
> documentation indicates that it works with Flink (but I cannot tell for
> which version).
>
> Cheers,
> Till
>
> On Mon, Mar 14, 2016 at 3:18 PM, Andrea Sella 
> wrote:
>
>> Hi to all,
>>
>> I'm trying to integrate Alluxio and Apache Flink, I followed Running
>> Flink on Alluxio
>>  to
>> setup Flink.
>>
>> I tested in local mode executing:
>>
>> bin/flink run ./examples/batch/WordCount.jar --input
>> alluxio:///flink/README.txt
>>
>> But I've faced a TimeoutException, I attach the logs. It seems the
>> trouble is due to netty dependencies-conflict.
>>
>> Thank you,
>> Andrea
>>
>>
>>
>


Re: Behavior of SlidingProessingTimeWindow with CountTrigger

2016-03-14 Thread Vishnu Viswanath
Hi Aijoscha,

Wow, great illustration.

That was very clear explanation. Yes, I did enter the elements fast for
case b and I was seeing more of case As.
Also, sometimes I have seen a window getting triggered when I enter 1 or 2
elements, I believe that is expansion of case A, w.r.t to window 2.

Also can you explain me the case when using Evictor.
e.g.,


val counts = socTextStream.flatMap{_.split("\\s")}
  .map { (_, 1) }
  .keyBy(0)
  .window(SlidingProcessingTimeWindows.of(Time.seconds(20),Time.seconds(10)))
  .trigger(CountTrigger.of(5))
  .evictor(CountEvictor.of(3))
  .sum(1).setParallelism(4);

counts.print()
sev.execute()

for the input


a

a

a

a

a

b

b

b

b

b

I got the output as


1> (a,3)

1> (b,3)

2> (b,3)

My assumption was that, when the Trigger is triggered, the processing will
be done on the entire items in the window,

and then 3 items will be evicted from the window, which can also be part of
the next processing of that window. But

here it looks like  the sum is calculated only on the items that were
evicted from the window.

Could you please explain what is going on here.


Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com *
​

On Mon, Mar 14, 2016 at 5:27 AM, Aljoscha Krettek 
wrote:

> Hi,
> I created a visualization to help explain the situation:
> http://s21.postimg.org/dofhcw52f/window_example.png
> The SlidingProcessingTimeWindows assigner assigns elements to windows
> based on the current processing time. The CountTrigger only fires if a
> window contains 5 elements (or more). In your test the windows for a, c and
> e fell into case b because you probably entered the letters very fast. For
> elements  b and d we have case a The elements were far enough apart or you
> happened to enter them right on a window boundary such that only one window
> contains all of them. The other windows don’t contain enough elements to
> reach 5. In my drawing window 1 contains 5 elements while window 2 only
> contains 3 of those elements.
>
> I hope this helps.
>
> Cheers,
> Aljoscha
>
> On 12 Mar 2016, at 19:19, Vishnu Viswanath 
> wrote:
>
> Hi All,
>
>
> I have the below code
>
>
> val sev = StreamExecutionEnvironment.getExecutionEnvironment
> val socTextStream = sev.socketTextStream("localhost",)
>
> val counts = socTextStream.flatMap{_.split("\\s")}
>   .map { (_, 1) }
>   .keyBy(0)
>
> .window(SlidingProcessingTimeWindows.of(Time.seconds(20),Time.seconds(10)))
>   .trigger(CountTrigger.of(5))
>   .sum(1)
>
> counts.print()
> sev.execute()
>
> I am sending messages to the port  using nc -lk 
> This is my sample input
>
> a
> a
> a
> a
> a
> b
> b
> b
> b
> b
> c
> c
> c
> c
> c
> d
> d
> d
> d
> d
> e
> e
> e
> e
> e
>
> I am sending 5 of each letter since I have a Count Trigger of 5. I was
> expecting that for each 5 character, the code will print 5, i.e., (a,5)
> (b,5) etc. But the output I am getting is little confusing.
> Output:
>
> 1> (a,5)
> 1> (a,5)
> 1> (b,5)
> 2> (c,5)
> 2> (c,5)
> 1> (d,5)
> 1> (e,5)
> 1> (e,5)
>
> As you can see, for some character the count is printed twice(a,c,e) and
> for some characters it is printed only once (b,d). I am not able to figure
> out what is going on. I think it may have something to do with the
> SlidingProcessingTimeWindow but I am not sure.
> Can someone explain me what is going on?
>
>
> Thanks and Regards,
> Vishnu Viswanath
> www.vishnuviswanath.com
>
>
>


Memory ran out PageRank

2016-03-14 Thread Ovidiu-Cristian MARCU
Hi,

While running PageRank on a synthetic graph I run into this problem: 
Any advice on how should I proceed to overcome this memory issue?

IterationHead(Vertex-centric iteration 
(org.apache.flink.graph.library.PageRank$VertexRankUpdater@7712cae0 | 
org.apache.flink.graph.library.PageRank$RankMesseng$
java.lang.RuntimeException: Memory ran out. Compaction failed. numPartitions: 
32 minPartition: 24 maxPartition: 25 number of overflow segments: 328 
bucketSize: 638 Overall memory: 115539968 Partition memory: 50659328 Message: 
Index: 25, Size: 24
at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertRecordIntoPartition(CompactingHashTable.java:469)
at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:414)
at 
org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:325)
at 
org.apache.flink.runtime.iterative.task.IterationHeadTask.readInitialSolutionSet(IterationHeadTask.java:212)
at 
org.apache.flink.runtime.iterative.task.IterationHeadTask.run(IterationHeadTask.java:273)
at 
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
at java.lang.Thread.run(Thread.java:745)

Thanks!

Best,
Ovidiu

Re: Application logging on YARN

2016-03-14 Thread Stefano Baghino
Ok, my bad, I was simply looking in the wrong place. I though the logs were
sent to YARN but they were actually stored in the Flink logs folder.
Problem solved, sorry for the mix up.

On Sun, Mar 13, 2016 at 8:48 PM, Stefano Baghino <
stefano.bagh...@radicalbit.io> wrote:

> There's another open thread on a similar situation but I'm not quite sure
> this is completely related, so I'm opening a new one.
>
> Is there a particular setting that has to be used when logging from an
> application that is running in YARN? I quickly put together a simple
> application that reads from and writes to Kafka, starting from the Maven
> Java archetype. I didn't edit the log4j.properties coming with the
> archetype, so it's currently using a ConsoleAppender, but I'm not sure on
> how I should edit it in order to make it appear in the YARN aggregate logs;
> as of now, yarn logs just show the TaskManager and JobManager logs, with no
> signs of my logs appearing on them. Should I look elsewhere.
>
> In the application I use SLF4J, like this:
>
> private static final Logger logger = LoggerFactory.getLogger(Job.class);
>
> To make sure I'm not just skipping the points where I log, I put error
> level logs at the very first line of my application. Still nothing.
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>



-- 
BR,
Stefano Baghino

Software Engineer @ Radicalbit


Integration Alluxio and Flink

2016-03-14 Thread Andrea Sella
Hi to all,

I'm trying to integrate Alluxio and Apache Flink, I followed Running Flink
on Alluxio
 to
setup Flink.

I tested in local mode executing:

bin/flink run ./examples/batch/WordCount.jar --input
alluxio:///flink/README.txt

But I've faced a TimeoutException, I attach the logs. It seems the trouble
is due to netty dependencies-conflict.

Thank you,
Andrea


alluxio-integration.log
Description: Binary data


Re: TimeWindow not getting last elements any longer with flink 1.0 vs 0.10.1

2016-03-14 Thread Till Rohrmann
Hi Arnaud,

with version 1.0 the behaviour for window triggering in case of a finite
stream was slightly changed. If you use event time, then all unfinished
windows are triggered in case that your stream ends. This can be motivated
by the fact that the end of a stream is equivalent to no elements will
arrive until the maximum time (infinity) has been reached. This knowledge,
allows you to emit a Long.MaxValue watermark when an event time stream is
finished, which will trigger all lingering windows.

In contrast to event time, you cannot say the same about a finished
processing time stream. There we don’t have logical time but the actual
processing time we use to reason about windows. When a stream finishes,
then we cannot fast forward the processing time to a point where the
windows will fire. This can only happen if we keep the operators alive
until the wall clock tells us that it’s time to fire the windows. However,
there is no such feature implemented yet in Flink.

I hope this helps you to understand the failing test cases.

Cheers,
Till
​

On Mon, Mar 14, 2016 at 1:14 PM, LINZ, Arnaud 
wrote:

> Hello,
>
>
>
> I’ve switched my Flink version from 0.10.1 to 1.0 and I have a regression
> in some  of my unit tests.
>
>
>
> To narrow the problem, here is what I’ve figured out:
>
>
>
> -  I use a simple Streaming application with a source defined as
> “fromElements("Element 1", "Element 2", "Element 3")
>
> -  I use a simple time window function with a 3 second window :
> timeWindowAll(Time.seconds(3))
>
> -  I use an apply() function and counts the total number of
> elements I get with a global counter
>
>
>
> With the previous version, I got all three elements because, not because
> they are  triggered under 3 seconds, but because the source ends
>
> With the 1.0 version, I don’t get any elements, and that’s annoying
> because as the source ends the application ends even if I sleep 5 seconds
> after the execute() method.
>
>
>
> (If I replace fromElement with fromCollection with a 1 element list
> and Time.second(3) with Time.millisecond(1), I get a random number of
> elements)
>
>
>
> Is this behavior wanted ? If yes, how do I get my last elements now ?
>
>
>
> Best regards,
>
> Arnaud
>
>
>
>
>
>
>
> --
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si
> vous n'êtes pas destinataire de ce message, merci de le détruire et
> d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The
> company that sent this message cannot therefore be held liable for its
> content nor attachments. Any unauthorized use or dissemination is
> prohibited. If you are not the intended recipient of this message, then
> please delete it and notify the sender.
>


Re: Flink and YARN ship folder

2016-03-14 Thread Andrea Sella
Hi Robert,

Ok, thank you.

2016-03-14 11:13 GMT+01:00 Robert Metzger :

> Hi Andrea,
>
> You don't have to manually replicate any operations on the slaves. All
> files in the lib/ folder are transferred to all containers (Jobmanagers and
> TaskManagers).
>
>
> On Sat, Mar 12, 2016 at 3:25 PM, Andrea Sella 
> wrote:
>
>> Hi Ufuk,
>>
>> I'm trying to execute the WordCount batch example with input and output
>> on Alluxio, i followed Running Flink on Alluxio
>>  and
>> added the library to lib folder. Have I to replicate this operation on the
>> slaves or YARN manage that and I must have the library just where I launch
>> the job?
>>
>> Thanks,
>> Andrea
>>
>> 2016-03-11 19:23 GMT+01:00 Ufuk Celebi :
>>
>>> Everything in the lib folder should be added to the classpath. Can you
>>> check the YARN client logs that the files are uploaded? Furthermore,
>>> you can check the classpath of the JVM in the YARN logs of the
>>> JobManager/TaskManager processes.
>>>
>>> – Ufuk
>>>
>>>
>>> On Fri, Mar 11, 2016 at 5:33 PM, Andrea Sella
>>>  wrote:
>>> > Hi,
>>> >
>>> > There is a way to add external dependencies to Flink Job,  running on
>>> YARN,
>>> > not using HADOOP_CLASSPATH?
>>> > I am looking for a similar idea to standalone mode using lib folder.
>>> >
>>> > BR,
>>> > Andrea
>>>
>>
>>
>


RE:Flink job on secure Yarn fails after many hours

2016-03-14 Thread Thomas Lamirault
Hello everyone,



We are facing the same probleme now in our Flink applications, launch using 
YARN.

Just want to know if there is any update about this exception ?



Thanks



Thomas





De : ni...@basj.es [ni...@basj.es] de la part de Niels Basjes [ni...@basjes.nl]
Envoyé : vendredi 4 décembre 2015 10:40
À : user@flink.apache.org
Objet : Re: Flink job on secure Yarn fails after many hours

Hi Maximilian,

I just downloaded the version from your google drive and used that to run my 
test topology that accesses HBase.
I deliberately started it twice to double the chance to run into this situation.

I'll keep you posted.

Niels


On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels 
> wrote:
Hi Niels,

Just got back from our CI. The build above would fail with a
Checkstyle error. I corrected that. Also I have built the binaries for
your Hadoop version 2.6.0.

Binaries:

https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip

Thanks,
Max

On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels 
<0.0.0.0:41281
 >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
 >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
 >> >> > stopping
 >> >> > process...
 >> >> > 21:30:28,286 INFO
 >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
 >> >> > - Removing web root dir
 >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
 >> >> >
 >> >> >
 >> >> > --
 >> >> > Best regards / Met vriendelijke groeten,
 >> >> >
 >> >> > Niels Basjes
 >> >
 >> >
 >> >
 >> >
 >> > --
 >> > Best regards / Met vriendelijke groeten,
 >> >
 >> > Niels Basjes
 >
 >
 >
 >
 > --
 > Best regards / Met vriendelijke groeten,
 >
 > Niels Basjes
>>>
>>>
>>>
>>>
>>> --
>>> Best regards / Met vriendelijke groeten,
>>>
>>> Niels Basjes



--
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: kafka.javaapi.consumer.SimpleConsumer class not found

2016-03-14 Thread Balaji Rajagopalan
Yep the same issue as before(class not found)  with flink 0.10.2 with scala
version 2.11. I was not able to use scala 2.10 since connector for
flink_connector_kafka for 0.10.2 is not available.

balaji

On Mon, Mar 14, 2016 at 4:20 PM, Balaji Rajagopalan <
balaji.rajagopa...@olacabs.com> wrote:

> Yes figured that out, thanks for point that, my bad. I have put back
> 0.10.2 as flink version, will try to reproduce the problem again, this time
> I have explicitly called out the scala version as 2.11.
>
>
> On Mon, Mar 14, 2016 at 4:14 PM, Robert Metzger 
> wrote:
>
>> Hi,
>>
>> flink-connector-kafka_ doesn't exist for 1.0.0. You have to use either
>> flink-connector-kafka-0.8_ or flink-connector-kafka-0.9_
>>
>>
>> On Mon, Mar 14, 2016 at 11:17 AM, Balaji Rajagopalan <
>> balaji.rajagopa...@olacabs.com> wrote:
>>
>>> What I noticied was that, if I remove the dependency on
>>> flink-connector-kafka so it is clearly to do something with that
>>> dependency.
>>>
>>>
>>> On Mon, Mar 14, 2016 at 3:46 PM, Balaji Rajagopalan <
>>> balaji.rajagopa...@olacabs.com> wrote:
>>>
 Robert,
I have  moved on to latest version of flink of 1.0.0 hoping that
 will solve my problem with kafka connector . Here is my pom.xml but now I
 cannot get the code compiled.

 [ERROR] Failed to execute goal
 net.alchim31.maven:scala-maven-plugin:3.2.1:compile (scala-compile-first)
 on project flink-streaming-demo: Execution scala-compile-first of goal
 net.alchim31.maven:scala-maven-plugin:3.2.1:compile failed: For artifact
 {null:null:null:jar}: The groupId cannot be empty. -> [Help 1]

 I read about the above errors in most cases people where able to
 overcome is by deleting the .m2 directory, and that did not fix the issue
 for me.

 What I noticied was that, if I remove the dependency on

 Here is my pom.xml

 
 
 http://maven.apache.org/POM/4.0.0; 
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
 http://maven.apache.org/xsd/maven-4.0.0.xsd;>
4.0.0

com.dataArtisans
flink-streaming-demo
0.1
jar

Flink Streaming Demo
http://www.data-artisans.com


   UTF-8
   1.7.12
   1.0.0
   2.10






   
  org.apache.flink
  flink-streaming-scala_${scala.version}
  ${flink.version}
   

   
  org.apache.flink
  flink-runtime-web_${scala.version}
  ${flink.version}
   

   
  org.elasticsearch
  elasticsearch
  1.7.3
  compile
   

   
  joda-time
  joda-time
  2.7
   

   
  org.apache.kafka
  kafka_${scala.version}
  0.8.2.0
   

 
   org.apache.flink
   flink-connector-kafka_${scala.version}
   ${flink.version}
   

   
 org.json4s
 json4s-native_${scala.version}
 3.3.0
   





   

  
  
 net.alchim31.maven
 scala-maven-plugin
 3.2.1
 


   scala-compile-first
   process-resources
   
  compile
   




   scala-test-compile
   process-test-resources
   
  testCompile
   

 
 

   -Xms128m
   -Xmx512m

 
  

  
 org.apache.maven.plugins
 maven-dependency-plugin
 2.9
 

   unpack
   
   prepare-package
   
  unpack
   
   
  
 
 
org.apache.flink

 flink-connector-kafka_${scala.version}
1.0.0
jar
false

 ${project.build.directory}/classes
org/apache/flink/**
 
   

Re: kafka.javaapi.consumer.SimpleConsumer class not found

2016-03-14 Thread Balaji Rajagopalan
Yes figured that out, thanks for point that, my bad. I have put back 0.10.2
as flink version, will try to reproduce the problem again, this time I have
explicitly called out the scala version as 2.11.


On Mon, Mar 14, 2016 at 4:14 PM, Robert Metzger  wrote:

> Hi,
>
> flink-connector-kafka_ doesn't exist for 1.0.0. You have to use either
> flink-connector-kafka-0.8_ or flink-connector-kafka-0.9_
>
>
> On Mon, Mar 14, 2016 at 11:17 AM, Balaji Rajagopalan <
> balaji.rajagopa...@olacabs.com> wrote:
>
>> What I noticied was that, if I remove the dependency on
>> flink-connector-kafka so it is clearly to do something with that
>> dependency.
>>
>>
>> On Mon, Mar 14, 2016 at 3:46 PM, Balaji Rajagopalan <
>> balaji.rajagopa...@olacabs.com> wrote:
>>
>>> Robert,
>>>I have  moved on to latest version of flink of 1.0.0 hoping that will
>>> solve my problem with kafka connector . Here is my pom.xml but now I cannot
>>> get the code compiled.
>>>
>>> [ERROR] Failed to execute goal
>>> net.alchim31.maven:scala-maven-plugin:3.2.1:compile (scala-compile-first)
>>> on project flink-streaming-demo: Execution scala-compile-first of goal
>>> net.alchim31.maven:scala-maven-plugin:3.2.1:compile failed: For artifact
>>> {null:null:null:jar}: The groupId cannot be empty. -> [Help 1]
>>>
>>> I read about the above errors in most cases people where able to
>>> overcome is by deleting the .m2 directory, and that did not fix the issue
>>> for me.
>>>
>>> What I noticied was that, if I remove the dependency on
>>>
>>> Here is my pom.xml
>>>
>>> 
>>> 
>>> http://maven.apache.org/POM/4.0.0; 
>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
>>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>>> http://maven.apache.org/xsd/maven-4.0.0.xsd;>
>>>4.0.0
>>>
>>>com.dataArtisans
>>>flink-streaming-demo
>>>0.1
>>>jar
>>>
>>>Flink Streaming Demo
>>>http://www.data-artisans.com
>>>
>>>
>>>   UTF-8
>>>   1.7.12
>>>   1.0.0
>>>   2.10
>>>
>>>
>>>
>>>
>>>
>>>
>>>   
>>>  org.apache.flink
>>>  flink-streaming-scala_${scala.version}
>>>  ${flink.version}
>>>   
>>>
>>>   
>>>  org.apache.flink
>>>  flink-runtime-web_${scala.version}
>>>  ${flink.version}
>>>   
>>>
>>>   
>>>  org.elasticsearch
>>>  elasticsearch
>>>  1.7.3
>>>  compile
>>>   
>>>
>>>   
>>>  joda-time
>>>  joda-time
>>>  2.7
>>>   
>>>
>>>   
>>>  org.apache.kafka
>>>  kafka_${scala.version}
>>>  0.8.2.0
>>>   
>>>
>>> 
>>>   org.apache.flink
>>>   flink-connector-kafka_${scala.version}
>>>   ${flink.version}
>>>   
>>>
>>>   
>>> org.json4s
>>> json4s-native_${scala.version}
>>> 3.3.0
>>>   
>>>
>>>
>>>
>>>
>>>
>>>   
>>>
>>>  
>>>  
>>> net.alchim31.maven
>>> scala-maven-plugin
>>> 3.2.1
>>> 
>>>
>>>
>>>   scala-compile-first
>>>   process-resources
>>>   
>>>  compile
>>>   
>>>
>>>
>>>
>>>
>>>   scala-test-compile
>>>   process-test-resources
>>>   
>>>  testCompile
>>>   
>>>
>>> 
>>> 
>>>
>>>   -Xms128m
>>>   -Xmx512m
>>>
>>> 
>>>  
>>>
>>>  
>>> org.apache.maven.plugins
>>> maven-dependency-plugin
>>> 2.9
>>> 
>>>
>>>   unpack
>>>   
>>>   prepare-package
>>>   
>>>  unpack
>>>   
>>>   
>>>  
>>> 
>>> 
>>>org.apache.flink
>>>
>>> flink-connector-kafka_${scala.version}
>>>1.0.0
>>>jar
>>>false
>>>
>>> ${project.build.directory}/classes
>>>org/apache/flink/**
>>> 
>>> 
>>> 
>>>org.apache.kafka
>>>kafka_${scala.version}
>>>0.8.2.0
>>>jar
>>>false
>>>
>>> ${project.build.directory}/classes
>>>kafka/**
>>> 
>>>  
>>>   
>>>
>>>

Re: Kafka integration error

2016-03-14 Thread Robert Metzger
Hi Stefanos,

this looks like an issue with Kafka. Which version does your broker have?
Can you check the logs of the broker you are trying to connect to?

On Fri, Mar 11, 2016 at 5:27 PM, Stefanos Antaris <
antaris.stefa...@gmail.com> wrote:

> Hi to all,
>
> i am trying to make Flink to work with Kafka but i always have the
> following exception. It works perfect on my laptop but when i try to use it
> on the cluster it always fails.
>
> java.lang.Exception
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher.run(LegacyFetcher.java:222)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08.run(FlinkKafkaConsumer08.java:316)
>   at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:78)
>   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:56)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:224)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedChannelException
>   at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
>   at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:78)
>   at 
> kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:68)
>   at 
> kafka.consumer.SimpleConsumer.getOffsetsBefore(SimpleConsumer.scala:127)
>   at 
> kafka.javaapi.consumer.SimpleConsumer.getOffsetsBefore(SimpleConsumer.scala:79)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.getLastOffset(LegacyFetcher.java:759)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.getMissingOffsetsFromKafka(LegacyFetcher.java:712)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.run(LegacyFetcher.java:462)
>
>
>
> Here is my pom.xml if it helps with the error
>
> 
> 
> org.apache.flink
> flink-core
> 1.0.0
> 
>
> 
> org.apache.flink
> flink-clients_2.10
> 1.0.0
> 
>
>
> 
> org.apache.flink
> flink-streaming-java_2.10
> 1.0.0
> 
>
>
> 
> org.apache.flink
> flink-connector-kafka-0.8_2.10
> 1.0.0
> 
>
>
> 
>
> Best regards,
> Stefanos
>


Re: Checkpoint

2016-03-14 Thread Aljoscha Krettek
Hi,
I’m not aware of a problem where pending files are not moved to their final 
locations. So if you have such a behavior it would indicate a bug. Also, the 
"trigger checkpoint” does not yet indicate that the checkpoint is happening. If 
you have a very long sleep interval in some of your operations still will also 
block checkpointing from happening. A checkpoint at an operation can only be 
performed when control is currently not inside a user function. We do this to 
ensure consistency.

If you’d like to help you can also increase the log level to DEBUG, then the 
RollingSink will print very detailed information about what it does, for 
example moving files to/from “pending. If you can reproduce the problem with 
DEBUG logs this would help me finding if there is a problem with Flink.

Regards,
Aljoscha
> On 10 Mar 2016, at 19:44, Vijay Srinivasaraghavan  
> wrote:
> 
> Thanks Ufuk and Stephan.
> 
> I have added Identity mapper and disabled chaining. With that, I am able to 
> see the backpressue alert on the identify mapper task.
> 
> I have noticed one thing that when I introduced delay (sleep) on the 
> subsequent task, sometimes checkpoint is not working. I could see checkpoint 
> trigger but the files are not moved from "pending" state. I will try to 
> reproduce to find the pattern but are you aware of any such scenario?
> 
> Regards
> Vijay
> 
> On Thursday, March 10, 2016 2:51 AM, Stephan Ewen  wrote:
> 
> 
> Just to be sure: Is the task whose backpressure you want to monitor the Kafka 
> Source?
> 
> There is an open issue that backpressure monitoring does not work for the 
> Kafka Source: https://issues.apache.org/jira/browse/FLINK-3456
> 
> To circumvent that, add an "IdentityMapper" after the Kafka source, make sure 
> it is non-chained, and monitor the backpressure on that MapFunction.
> 
> Greetings,
> Stephan
> 
> 
> On Thu, Mar 10, 2016 at 11:23 AM, Robert Metzger  wrote:
> Hi Vijay,
> 
> regarding your other questions:
> 
> 1) On the TaskManagers, the FlinkKafkaConsumers will write the partitions 
> they are going to read in the log. There is currently no way of seeing the 
> state of a checkpoint in Flink (which is the offsets).
> However, once a checkpoint is completed, the Kafka consumer is committing the 
> offset to the Kafka broker. (I could not find tool to get the committed 
> offsets from the broker, but its either stored in ZK or in a special topic by 
> the broker. In Kafka 0.8 that's easily doable with the 
> kafka.tools.ConsumerOffsetChecker)
> 
> 2) Do you see duplicate data written by the rolling file sink? Or do you see 
> it somewhere else?
> HDP 2.4 is using Hadoop 2.7.1 so the truncate() of invalid data should 
> actually work properly.
> 
> 
> 
> 
> 
> On Thu, Mar 10, 2016 at 10:44 AM, Ufuk Celebi  wrote:
> How many vertices does the web interface show and what parallelism are
> you running? If the sleeping operator is chained you will not see
> anything.
> 
> If your goal is to just see some back pressure warning, you can call
> env.disableOperatorChaining() and re-run the program. Does this work?
> 
> – Ufuk
> 
> 
> On Thu, Mar 10, 2016 at 1:35 AM, Vijay Srinivasaraghavan
>  wrote:
> > Hi Ufuk,
> >
> > I have increased the sampling size to 1000 and decreased the refresh
> > interval by half. In my Kafka topic I have pumped million messages which is
> > read by KafkaConsumer pipeline and then pass it to a transofmation step
> > where I have introduced sleep (3 sec) for every single message received and
> > the final step is HDFS sink using RollingSinc API.
> >
> > jobmanager.web.backpressure.num-samples: 1000
> > jobmanager.web.backpressure.refresh-interval: 3
> >
> >
> > I was hoping to see the backpressure tab from UI to display some warning but
> > I still see "OK" message.
> >
> > This makes me wonder if I am testing the backpressure scenario properly or
> > not?
> >
> > Regards
> > Vijay
> >
> > On Monday, March 7, 2016 3:19 PM, Ufuk Celebi  wrote:
> >
> >
> > Hey Vijay!
> >
> > On Mon, Mar 7, 2016 at 8:42 PM, Vijay Srinivasaraghavan
> >  wrote:
> >> 3) How can I simulate and verify backpressure? I have introduced some
> >> delay
> >> (Thread Sleep) in the job before the sink but the "backpressure" tab from
> >> UI
> >> does not show any indication of whether backpressure is working or not.
> >
> > If a task is slow, it is back pressuring upstream tasks, e.g. if your
> > transformations have the sleep, the sources should be back pressured.
> > It can happen that even with the sleep the tasks still produce their
> > data as fast as they can and hence no back pressure is indicated in
> > the web interface. You can increase the sleep to check this.
> >
> > The mechanism used to determine back pressure is based on sampling the
> > stack traces of running tasks. You can increase the number of samples
> > and/or decrease the 

Re: kafka.javaapi.consumer.SimpleConsumer class not found

2016-03-14 Thread Robert Metzger
Hi,

flink-connector-kafka_ doesn't exist for 1.0.0. You have to use either
flink-connector-kafka-0.8_ or flink-connector-kafka-0.9_


On Mon, Mar 14, 2016 at 11:17 AM, Balaji Rajagopalan <
balaji.rajagopa...@olacabs.com> wrote:

> What I noticied was that, if I remove the dependency on
> flink-connector-kafka so it is clearly to do something with that
> dependency.
>
>
> On Mon, Mar 14, 2016 at 3:46 PM, Balaji Rajagopalan <
> balaji.rajagopa...@olacabs.com> wrote:
>
>> Robert,
>>I have  moved on to latest version of flink of 1.0.0 hoping that will
>> solve my problem with kafka connector . Here is my pom.xml but now I cannot
>> get the code compiled.
>>
>> [ERROR] Failed to execute goal
>> net.alchim31.maven:scala-maven-plugin:3.2.1:compile (scala-compile-first)
>> on project flink-streaming-demo: Execution scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.1:compile failed: For artifact
>> {null:null:null:jar}: The groupId cannot be empty. -> [Help 1]
>>
>> I read about the above errors in most cases people where able to overcome
>> is by deleting the .m2 directory, and that did not fix the issue for me.
>>
>> What I noticied was that, if I remove the dependency on
>>
>> Here is my pom.xml
>>
>> 
>> 
>> http://maven.apache.org/POM/4.0.0; 
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>> http://maven.apache.org/xsd/maven-4.0.0.xsd;>
>>4.0.0
>>
>>com.dataArtisans
>>flink-streaming-demo
>>0.1
>>jar
>>
>>Flink Streaming Demo
>>http://www.data-artisans.com
>>
>>
>>   UTF-8
>>   1.7.12
>>   1.0.0
>>   2.10
>>
>>
>>
>>
>>
>>
>>   
>>  org.apache.flink
>>  flink-streaming-scala_${scala.version}
>>  ${flink.version}
>>   
>>
>>   
>>  org.apache.flink
>>  flink-runtime-web_${scala.version}
>>  ${flink.version}
>>   
>>
>>   
>>  org.elasticsearch
>>  elasticsearch
>>  1.7.3
>>  compile
>>   
>>
>>   
>>  joda-time
>>  joda-time
>>  2.7
>>   
>>
>>   
>>  org.apache.kafka
>>  kafka_${scala.version}
>>  0.8.2.0
>>   
>>
>> 
>>   org.apache.flink
>>   flink-connector-kafka_${scala.version}
>>   ${flink.version}
>>   
>>
>>   
>> org.json4s
>> json4s-native_${scala.version}
>> 3.3.0
>>   
>>
>>
>>
>>
>>
>>   
>>
>>  
>>  
>> net.alchim31.maven
>> scala-maven-plugin
>> 3.2.1
>> 
>>
>>
>>   scala-compile-first
>>   process-resources
>>   
>>  compile
>>   
>>
>>
>>
>>
>>   scala-test-compile
>>   process-test-resources
>>   
>>  testCompile
>>   
>>
>> 
>> 
>>
>>   -Xms128m
>>   -Xmx512m
>>
>> 
>>  
>>
>>  
>> org.apache.maven.plugins
>> maven-dependency-plugin
>> 2.9
>> 
>>
>>   unpack
>>   
>>   prepare-package
>>   
>>  unpack
>>   
>>   
>>  
>> 
>> 
>>org.apache.flink
>>
>> flink-connector-kafka_${scala.version}
>>1.0.0
>>jar
>>false
>>
>> ${project.build.directory}/classes
>>org/apache/flink/**
>> 
>> 
>> 
>>org.apache.kafka
>>kafka_${scala.version}
>>0.8.2.0
>>jar
>>false
>>
>> ${project.build.directory}/classes
>>kafka/**
>> 
>>  
>>   
>>
>> 
>>  
>>
>>  
>>
>>  
>> org.apache.maven.plugins
>> maven-compiler-plugin
>> 3.1
>> 
>>1.8 
>>1.8 
>> 
>>  
>>
>>  
>> org.apache.rat
>> apache-rat-plugin
>> 0.10
>> false
>> 
>>
>>   verify
>>   
>>  

Re: Behavior of SlidingProessingTimeWindow with CountTrigger

2016-03-14 Thread Aljoscha Krettek
Hi,
I created a visualization to help explain the situation: 
http://s21.postimg.org/dofhcw52f/window_example.png

The SlidingProcessingTimeWindows assigner assigns elements to windows based on 
the current processing time. The CountTrigger only fires if a window contains 5 
elements (or more). In your test the windows for a, c and e fell into case b 
because you probably entered the letters very fast. For elements  b and d we 
have case a The elements were far enough apart or you happened to enter them 
right on a window boundary such that only one window contains all of them. The 
other windows don’t contain enough elements to reach 5. In my drawing window 1 
contains 5 elements while window 2 only contains 3 of those elements.

I hope this helps.

Cheers,
Aljoscha
> On 12 Mar 2016, at 19:19, Vishnu Viswanath  
> wrote:
> 
> Hi All,
> 
> 
> I have the below code
> 
> 
> val sev = StreamExecutionEnvironment.getExecutionEnvironment
> val socTextStream = sev.socketTextStream("localhost",)
> 
> val counts = socTextStream.flatMap{_.split("\\s")}
>   .map { (_, 1) }
>   .keyBy(0)
>   .window(SlidingProcessingTimeWindows.of(Time.seconds(20),Time.seconds(10)))
>   .trigger(CountTrigger.of(5))
>   .sum(1)
> 
> counts.print()
> sev.execute()
> 
> I am sending messages to the port  using nc -lk 
> This is my sample input
> 
> a
> a
> a
> a
> a
> b
> b
> b
> b
> b
> c
> c
> c
> c
> c
> d
> d
> d
> d
> d
> e
> e
> e
> e
> e
> 
> I am sending 5 of each letter since I have a Count Trigger of 5. I was 
> expecting that for each 5 character, the code will print 5, i.e., (a,5) (b,5) 
> etc. But the output I am getting is little confusing.
> Output:
> 
> 1> (a,5)
> 1> (a,5)
> 1> (b,5)
> 2> (c,5)
> 2> (c,5)
> 1> (d,5)
> 1> (e,5)
> 1> (e,5)
> 
> As you can see, for some character the count is printed twice(a,c,e) and for 
> some characters it is printed only once (b,d). I am not able to figure out 
> what is going on. I think it may have something to do with the 
> SlidingProcessingTimeWindow but I am not sure.
> Can someone explain me what is going on?
> 
> 
> Thanks and Regards,
> Vishnu Viswanath
> www.vishnuviswanath.com
> 



Re: kafka.javaapi.consumer.SimpleConsumer class not found

2016-03-14 Thread Balaji Rajagopalan
What I noticied was that, if I remove the dependency on
flink-connector-kafka so it is clearly to do something with that
dependency.


On Mon, Mar 14, 2016 at 3:46 PM, Balaji Rajagopalan <
balaji.rajagopa...@olacabs.com> wrote:

> Robert,
>I have  moved on to latest version of flink of 1.0.0 hoping that will
> solve my problem with kafka connector . Here is my pom.xml but now I cannot
> get the code compiled.
>
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.1:compile (scala-compile-first)
> on project flink-streaming-demo: Execution scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.1:compile failed: For artifact
> {null:null:null:jar}: The groupId cannot be empty. -> [Help 1]
>
> I read about the above errors in most cases people where able to overcome
> is by deleting the .m2 directory, and that did not fix the issue for me.
>
> What I noticied was that, if I remove the dependency on
>
> Here is my pom.xml
>
> 
> 
> http://maven.apache.org/POM/4.0.0; 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/xsd/maven-4.0.0.xsd;>
>4.0.0
>
>com.dataArtisans
>flink-streaming-demo
>0.1
>jar
>
>Flink Streaming Demo
>http://www.data-artisans.com
>
>
>   UTF-8
>   1.7.12
>   1.0.0
>   2.10
>
>
>
>
>
>
>   
>  org.apache.flink
>  flink-streaming-scala_${scala.version}
>  ${flink.version}
>   
>
>   
>  org.apache.flink
>  flink-runtime-web_${scala.version}
>  ${flink.version}
>   
>
>   
>  org.elasticsearch
>  elasticsearch
>  1.7.3
>  compile
>   
>
>   
>  joda-time
>  joda-time
>  2.7
>   
>
>   
>  org.apache.kafka
>  kafka_${scala.version}
>  0.8.2.0
>   
>
> 
>   org.apache.flink
>   flink-connector-kafka_${scala.version}
>   ${flink.version}
>   
>
>   
> org.json4s
> json4s-native_${scala.version}
> 3.3.0
>   
>
>
>
>
>
>   
>
>  
>  
> net.alchim31.maven
> scala-maven-plugin
> 3.2.1
> 
>
>
>   scala-compile-first
>   process-resources
>   
>  compile
>   
>
>
>
>
>   scala-test-compile
>   process-test-resources
>   
>  testCompile
>   
>
> 
> 
>
>   -Xms128m
>   -Xmx512m
>
> 
>  
>
>  
> org.apache.maven.plugins
> maven-dependency-plugin
> 2.9
> 
>
>   unpack
>   
>   prepare-package
>   
>  unpack
>   
>   
>  
> 
> 
>org.apache.flink
>
> flink-connector-kafka_${scala.version}
>1.0.0
>jar
>false
>
> ${project.build.directory}/classes
>org/apache/flink/**
> 
> 
> 
>org.apache.kafka
>kafka_${scala.version}
>0.8.2.0
>jar
>false
>
> ${project.build.directory}/classes
>kafka/**
> 
>  
>   
>
> 
>  
>
>  
>
>  
> org.apache.maven.plugins
> maven-compiler-plugin
> 3.1
> 
>1.8 
>1.8 
> 
>  
>
>  
> org.apache.rat
> apache-rat-plugin
> 0.10
> false
> 
>
>   verify
>   
>  check
>   
>
> 
> 
>false
>0
>
>   
>implementation="org.apache.rat.analysis.license.SimplePatternBasedLicense">
>  AL2 
>  Apache License 2.0
>  
>  
> Copyright 2015 data Artisans GmbH
> 

Re: kafka.javaapi.consumer.SimpleConsumer class not found

2016-03-14 Thread Balaji Rajagopalan
Robert,
   I have  moved on to latest version of flink of 1.0.0 hoping that will
solve my problem with kafka connector . Here is my pom.xml but now I cannot
get the code compiled.

[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.1:compile (scala-compile-first)
on project flink-streaming-demo: Execution scala-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.1:compile failed: For artifact
{null:null:null:jar}: The groupId cannot be empty. -> [Help 1]

I read about the above errors in most cases people where able to overcome
is by deleting the .m2 directory, and that did not fix the issue for me.

What I noticied was that, if I remove the dependency on

Here is my pom.xml



http://maven.apache.org/POM/4.0.0;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
   4.0.0

   com.dataArtisans
   flink-streaming-demo
   0.1
   jar

   Flink Streaming Demo
   http://www.data-artisans.com

   
  UTF-8
  1.7.12
  1.0.0
  2.10
   

   



  
 org.apache.flink
 flink-streaming-scala_${scala.version}
 ${flink.version}
  

  
 org.apache.flink
 flink-runtime-web_${scala.version}
 ${flink.version}
  

  
 org.elasticsearch
 elasticsearch
 1.7.3
 compile
  

  
 joda-time
 joda-time
 2.7
  

  
 org.apache.kafka
 kafka_${scala.version}
 0.8.2.0
  


  org.apache.flink
  flink-connector-kafka_${scala.version}
  ${flink.version}
  

  
org.json4s
json4s-native_${scala.version}
3.3.0
  


   

   
  

 
 
net.alchim31.maven
scala-maven-plugin
3.2.1

   
   
  scala-compile-first
  process-resources
  
 compile
  
   

   
   
  scala-test-compile
  process-test-resources
  
 testCompile
  
   


   
  -Xms128m
  -Xmx512m
   

 

 
org.apache.maven.plugins
maven-dependency-plugin
2.9

   
  unpack
  
  prepare-package
  
 unpack
  
  
 


   org.apache.flink

flink-connector-kafka_${scala.version}
   1.0.0
   jar
   false

${project.build.directory}/classes
   org/apache/flink/**



   org.apache.kafka
   kafka_${scala.version}
   0.8.2.0
   jar
   false

${project.build.directory}/classes
   kafka/**

 
  
   

 

 

 
org.apache.maven.plugins
maven-compiler-plugin
3.1

   1.8 
   1.8 

 

 
org.apache.rat
apache-rat-plugin
0.10
false

   
  verify
  
 check
  
   


   false
   0
   
  
  
 AL2 
 Apache License 2.0
 
 
Copyright 2015 data Artisans GmbH
Licensed under the Apache License,
Version 2.0 (the "License");
 
  
   
   
  
 Apache License 2.0
  
   
   
  
  **/.*
  **/*.prefs
  **/*.properties
  **/*.log
  *.txt/**
  
  **/README.md
  CHANGELOG
  
  **/*.iml
  
  **/target/**
  **/build/**
   

 

 
org.apache.maven.plugins
maven-checkstyle-plugin
2.12.1
 

Re: Log4j configuration on YARN

2016-03-14 Thread Robert Metzger
Hi Nick,

the name of the "log4j-yarn-session.properties" file might be a bit
misleading. The file is just used for the YARN session client, running
locally.
The Job- and TaskManager are going to use the log4j.properties on the
cluster.

On Fri, Mar 11, 2016 at 7:20 PM, Ufuk Celebi  wrote:

> Hey Nick!
>
> I just checked and the conf/log4j.properties file is copied and is
> given as an argument to the JVM.
>
> You should see the following:
> - client logs that the conf/log4j.properties file is copied
> - JobManager logs show log4j.configuration being passed to the JVM.
>
> Can you confirm that these shows up? If yes, but you still don't get
> the expected logging, I would check via -Dlog4j.debug what is
> configured (prints to stdout I think). Does this help?
>
> – Ufuk
>
>
> On Fri, Mar 11, 2016 at 6:02 PM, Nick Dimiduk  wrote:
> > Can anyone tell me where I must place my application-specific
> > log4j.properties to have them honored when running on a YARN cluster? In
> my
> > application jar doesn't work. In the log4j files under flink/conf doesn't
> > work.
> >
> > My goal is to set the log level for 'com.mycompany' classes used in my
> flink
> > application to DEBUG.
> >
> > Thanks,
> > Nick
> >
>


Re: Flink and YARN ship folder

2016-03-14 Thread Robert Metzger
Hi Andrea,

You don't have to manually replicate any operations on the slaves. All
files in the lib/ folder are transferred to all containers (Jobmanagers and
TaskManagers).


On Sat, Mar 12, 2016 at 3:25 PM, Andrea Sella 
wrote:

> Hi Ufuk,
>
> I'm trying to execute the WordCount batch example with input and output on
> Alluxio, i followed Running Flink on Alluxio
>  and
> added the library to lib folder. Have I to replicate this operation on the
> slaves or YARN manage that and I must have the library just where I launch
> the job?
>
> Thanks,
> Andrea
>
> 2016-03-11 19:23 GMT+01:00 Ufuk Celebi :
>
>> Everything in the lib folder should be added to the classpath. Can you
>> check the YARN client logs that the files are uploaded? Furthermore,
>> you can check the classpath of the JVM in the YARN logs of the
>> JobManager/TaskManager processes.
>>
>> – Ufuk
>>
>>
>> On Fri, Mar 11, 2016 at 5:33 PM, Andrea Sella
>>  wrote:
>> > Hi,
>> >
>> > There is a way to add external dependencies to Flink Job,  running on
>> YARN,
>> > not using HADOOP_CLASSPATH?
>> > I am looking for a similar idea to standalone mode using lib folder.
>> >
>> > BR,
>> > Andrea
>>
>
>


Using a POJO class wrapping an ArrayList

2016-03-14 Thread Mengqi Yang
Hi all,

Now I am building a POJO class for key selectors. Here is the class:

public class Ids implements Comparable, Serializable{

private static final long serialVersionUID = 1L;

private ArrayList ids = new ArrayList();

Ids() {}

Ids(ArrayList inputIds) {
this.ids = inputIds;
}

}

And the question is, how could I use each element in the array list as a
field for further key selection? I saw the typeinfo of Flink takes the field
of arraylist as 1. Or maybe I didn't understand it correctly.

Thanks,
Mengqi 



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Using-a-POJO-class-wrapping-an-ArrayList-tp5483.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: kafka.javaapi.consumer.SimpleConsumer class not found

2016-03-14 Thread Robert Metzger
Can you send me the full build file to further investigate the issue?

On Fri, Mar 11, 2016 at 4:56 PM, Balaji Rajagopalan <
balaji.rajagopa...@olacabs.com> wrote:

> Robert,
>   That did not fix it ( using flink and connector same version) . Tried
> with scala version 2.11, so will try to see scala 2.10 makes any
> difference.
>
> balaji
>
> On Fri, Mar 11, 2016 at 8:06 PM, Robert Metzger 
> wrote:
>
>> Hi,
>>
>> you have to use the same version for all dependencies from the
>> "org.apache.flink" group.
>>
>> You said these are the versions you are using:
>>
>> flink.version = 0.10.2
>> kafka.verison = 0.8.2
>> flink.kafka.connection.verion=0.9.1
>>
>> For the connector, you also need to use 0.10.2.
>>
>>
>>
>> On Fri, Mar 11, 2016 at 9:56 AM, Balaji Rajagopalan <
>> balaji.rajagopa...@olacabs.com> wrote:
>>
>>> I am tyring to use the flink kafka connector, for this I have specified
>>> the kafka connector dependency and created a fat jar since default flink
>>> installation does not contain kafka connector jars. I have made sure that
>>> flink-streaming-demo-0.1.jar has the
>>> kafka.javaapi.consumer.SimpleConsumer.class but still I see the class not
>>> found exception.
>>>
>>> The code for kafka connector in flink.
>>>
>>> val env = StreamExecutionEnvironment.getExecutionEnvironment
>>> val prop:Properties = new Properties()
>>> prop.setProperty("zookeeper.connect","somezookeer:2181")
>>> prop.setProperty("group.id","some-group")
>>> prop.setProperty("bootstrap.servers","somebroker:9092")
>>>
>>> val stream = env
>>>   .addSource(new FlinkKafkaConsumer082[String]("location", new 
>>> SimpleStringSchema, prop))
>>>
>>> jar tvf flink-streaming-demo-0.1.jar | grep
>>> kafka.javaapi.consumer.SimpleConsumer
>>>
>>>   5111 Fri Mar 11 14:18:36 UTC 2016
>>> *kafka/javaapi/consumer/SimpleConsumer*.class
>>>
>>> flink.version = 0.10.2
>>> kafka.verison = 0.8.2
>>> flink.kafka.connection.verion=0.9.1
>>>
>>> The command that I use to run the flink program in yarn cluster is
>>> below,
>>>
>>> HADOOP_CONF_DIR=/etc/hadoop/conf /usr/share/flink/bin/flink run -c
>>> com.dataartisans.flink_demo.examples.DriverEventConsumer  -m yarn-cluster
>>> -yn 2 /home/balajirajagopalan/flink-streaming-demo-0.1.jar
>>>
>>> java.lang.NoClassDefFoundError: kafka/javaapi/consumer/SimpleConsumer
>>>
>>> at
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.getPartitionsForTopic(FlinkKafkaConsumer.java:691)
>>>
>>> at
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.(FlinkKafkaConsumer.java:281)
>>>
>>> at
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer082.(FlinkKafkaConsumer082.java:49)
>>>
>>> at
>>> com.dataartisans.flink_demo.examples.DriverEventConsumer$.main(DriverEventConsumer.scala:53)
>>>
>>> at
>>> com.dataartisans.flink_demo.examples.DriverEventConsumer.main(DriverEventConsumer.scala)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>>
>>> at
>>> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:497)
>>>
>>> at
>>> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:395)
>>>
>>> at org.apache.flink.client.program.Client.runBlocking(Client.java:252)
>>>
>>> at
>>> org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:676)
>>>
>>> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:326)
>>>
>>> at
>>> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:978)
>>>
>>> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1028)
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> kafka.javaapi.consumer.SimpleConsumer
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>> ... 16 more
>>>
>>>
>>> Any help appreciated.
>>>
>>>
>>> balaji
>>>
>>
>>
>


Re: Passing two value to the ConvergenceCriterion function

2016-03-14 Thread Riccardo Diomedi
Ok!
On 14 Mar 2016, at 10:41, Robert Metzger  wrote:

> Hi,
> 
> take a look at the "Record" class. That one implements the Value interface 
> and can have multiple values.
> 
> On Fri, Mar 11, 2016 at 6:01 PM, Riccardo Diomedi 
>  wrote:
> Hi
> 
> I want to send two value to the ConvergenceCriterion function, so i decided 
> to use an aggregator of Tuple2. But then, when i implement 
> Aggregator, i cannot do that because Tuple2 doesn’t implement Value.
> So i tried to create a class Tuple2Value that implements Value, but here i 
> get stuck because i don’t know how to do it in a proper way.
> Any suggestions?
> Is there an alternative (and easy) way to pass two values to the 
> convergenceCriterion function?
> 
> Thank you
> 
> Riccardo!
> 



Re: Passing two value to the ConvergenceCriterion function

2016-03-14 Thread Robert Metzger
Hi,

take a look at the "Record" class. That one implements the Value interface
and can have multiple values.

On Fri, Mar 11, 2016 at 6:01 PM, Riccardo Diomedi <
riccardo.diomed...@gmail.com> wrote:

> Hi
>
> I want to send two value to the ConvergenceCriterion function, so i
> decided to use an aggregator of Tuple2. But then, when i implement
> Aggregator, i cannot do that because Tuple2 doesn’t implement Value.
> So i tried to create a class Tuple2Value that implements Value, but here i
> get stuck because i don’t know how to do it in a proper way.
> Any suggestions?
> Is there an alternative (and easy) way to pass two values to the
> convergenceCriterion function?
>
> Thank you
>
> Riccardo!