[GitHub] flink pull request: [FLINK-2184] Cannot get last element with maxB...

2016-05-16 Thread gallenvara
Github user gallenvara commented on the pull request:

https://github.com/apache/flink/pull/1975#issuecomment-219362150
  
Hi @fhueske , can you help review this PR? Thanks! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1873) Distributed matrix implementation

2016-05-16 Thread Simone Robutti (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284278#comment-15284278
 ] 

Simone Robutti commented on FLINK-1873:
---

The tasks are not independent so I would go for sub-tasks if necessary but I 
will adapt to what you consider better. Just let me know. 

> Distributed matrix implementation
> -
>
> Key: FLINK-1873
> URL: https://issues.apache.org/jira/browse/FLINK-1873
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: liaoyuxi
>Assignee: Simone Robutti
>  Labels: ML
>
> It would help to implement machine learning algorithm more quickly and 
> concise if Flink would provide support for storing data and computation in 
> distributed matrix. The design of the implementation is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3911) Sort operation before a group reduce doesn't seem to be implemented on 1.0.2

2016-05-16 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284291#comment-15284291
 ] 

Fabian Hueske commented on FLINK-3911:
--

Hi Patrice, that's a very good suggestion!
In fact, we have an open JIRA for this: FLINK-2399.

I'm closing this issue since the problem has been resolved.
Thanks for reporting!

> Sort operation before a group reduce doesn't seem to be implemented on 1.0.2
> 
>
> Key: FLINK-3911
> URL: https://issues.apache.org/jira/browse/FLINK-3911
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.0.2
> Environment: Linux Ubuntu, standalone cluster
>Reporter: Patrice Freydiere
>  Labels: newbie
>
> i have this piece of code: 
>  // group by id and sort on field order
> DataSet> waysGeometry = 
> joinedWaysWithPoints.groupBy(0).sortGroup(1, Order.ASCENDING)
> .reduceGroup(new 
> GroupReduceFunction, Tuple2 byte[]>>() {
> @Override
> public void 
> reduce(Iterable> values,
> 
> Collector> out) throws Exception {
> long id = -1;
> and this exception when executing ;
> ava.lang.Exception: The data preparation for task 'GroupReduce (GroupReduce 
> at constructOSMStreams(ProcessOSM.java:112))' , caused an error: Unrecognized 
> driver strategy for GroupReduce driver: SORTED_GROUP_COMBINE
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:456)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.Exception: Unrecognized driver strategy for GroupReduce 
> driver: SORTED_GROUP_COMBINE
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:90)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450)
>   ... 3 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-3911) Sort operation before a group reduce doesn't seem to be implemented on 1.0.2

2016-05-16 Thread Fabian Hueske (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Hueske closed FLINK-3911.

Resolution: Not A Bug

> Sort operation before a group reduce doesn't seem to be implemented on 1.0.2
> 
>
> Key: FLINK-3911
> URL: https://issues.apache.org/jira/browse/FLINK-3911
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.0.2
> Environment: Linux Ubuntu, standalone cluster
>Reporter: Patrice Freydiere
>  Labels: newbie
>
> i have this piece of code: 
>  // group by id and sort on field order
> DataSet> waysGeometry = 
> joinedWaysWithPoints.groupBy(0).sortGroup(1, Order.ASCENDING)
> .reduceGroup(new 
> GroupReduceFunction, Tuple2 byte[]>>() {
> @Override
> public void 
> reduce(Iterable> values,
> 
> Collector> out) throws Exception {
> long id = -1;
> and this exception when executing ;
> ava.lang.Exception: The data preparation for task 'GroupReduce (GroupReduce 
> at constructOSMStreams(ProcessOSM.java:112))' , caused an error: Unrecognized 
> driver strategy for GroupReduce driver: SORTED_GROUP_COMBINE
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:456)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.Exception: Unrecognized driver strategy for GroupReduce 
> driver: SORTED_GROUP_COMBINE
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:90)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450)
>   ... 3 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3911) Sort operation before a group reduce doesn't seem to be implemented on 1.0.2

2016-05-16 Thread Patrice Freydiere (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284296#comment-15284296
 ] 

Patrice Freydiere commented on FLINK-3911:
--

hi fabian,
i'm currently working on using flink for geospatial / ML stack,
hope i can make some communications.

thank's for your help,

Patrice



Le lun. 16 mai 2016 à 11:21, Fabian Hueske (JIRA)  a



> Sort operation before a group reduce doesn't seem to be implemented on 1.0.2
> 
>
> Key: FLINK-3911
> URL: https://issues.apache.org/jira/browse/FLINK-3911
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.0.2
> Environment: Linux Ubuntu, standalone cluster
>Reporter: Patrice Freydiere
>  Labels: newbie
>
> i have this piece of code: 
>  // group by id and sort on field order
> DataSet> waysGeometry = 
> joinedWaysWithPoints.groupBy(0).sortGroup(1, Order.ASCENDING)
> .reduceGroup(new 
> GroupReduceFunction, Tuple2 byte[]>>() {
> @Override
> public void 
> reduce(Iterable> values,
> 
> Collector> out) throws Exception {
> long id = -1;
> and this exception when executing ;
> ava.lang.Exception: The data preparation for task 'GroupReduce (GroupReduce 
> at constructOSMStreams(ProcessOSM.java:112))' , caused an error: Unrecognized 
> driver strategy for GroupReduce driver: SORTED_GROUP_COMBINE
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:456)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.Exception: Unrecognized driver strategy for GroupReduce 
> driver: SORTED_GROUP_COMBINE
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:90)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450)
>   ... 3 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284312#comment-15284312
 ] 

Gabor Gevay commented on FLINK-2147:


Unfortunately, no.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1827) Move test classes in test folders and fix scope of test dependencies

2016-05-16 Thread Stefano Bortoli (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284315#comment-15284315
 ] 

Stefano Bortoli commented on FLINK-1827:


Hi Josh,

first of all, sorry for the troubles caused! FYI, we are discussing about 
merging the test-utils classes into flink-tests component and remove it. 

I understand your points, and I am happy you could work around the dependencies 
configuration. The main objective of this activity was to make the build of 
flink faster, skipping tests (-Dmaven.test.skip=true). We to did it trying to 
follow maven best practices, but if you are willing to contribute any changes, 
you are welcome to do it. 

> Move test classes in test folders and fix scope of test dependencies
> 
>
> Key: FLINK-1827
> URL: https://issues.apache.org/jira/browse/FLINK-1827
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System
>Affects Versions: 0.9
>Reporter: Flavio Pompermaier
>Priority: Minor
>  Labels: test-compile
> Fix For: 1.1.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Right now it is not possible to avoid compilation of test classes 
> (-Dmaven.test.skip=true) because some project (e.g. flink-test-utils) 
> requires test classes in non-test sources (e.g. 
> scalatest_${scala.binary.version})
> Test classes should be moved to src/main/test (if Java) and src/test/scala 
> (if scala) and use scope=test for test dependencies



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284360#comment-15284360
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Would be ok to give it a shot in the long term, start with a short design 
document etc (although im newbie with Flink)? 

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284406#comment-15284406
 ] 

Gabor Gevay commented on FLINK-2147:


Yes, I think it would be very good if you worked on this, and starting with a 
design document is a good idea. However, if you haven't yet contributed to 
Flink, then I would suggest starting with some simpler tasks first. Note, that 
the hard part here will not be implementing the algorithm that is described in 
the linked paper, but figuring out how it fits into the API and internals of 
Flink.

(This was a sub-task of my Google Summer of Code project last summer, but then 
I haven't really worked on this, because the streaming API was changing a lot 
at that time. I will convert this sub-task to a stand-alone issue now.)

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay updated FLINK-2147:
---
Issue Type: New Feature  (was: Sub-task)
Parent: (was: FLINK-2142)

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay updated FLINK-2147:
---
Priority: Major  (was: Minor)

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2148] [contrib] Exact and approximate c...

2016-05-16 Thread ggevay
Github user ggevay commented on the pull request:

https://github.com/apache/flink/pull/910#issuecomment-219407507
  
I'm closing this. It doesn't really make much sense to do this on an entire 
stream; users would probably want to do this on large windows instead, so this 
should probably be reimplemented for the new windowing API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2148] [contrib] Exact and approximate c...

2016-05-16 Thread ggevay
Github user ggevay closed the pull request at:

https://github.com/apache/flink/pull/910


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (FLINK-2148) Approximately calculate the number of distinct elements of a stream

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay updated FLINK-2148:
---
Issue Type: New Feature  (was: Sub-task)
Parent: (was: FLINK-2142)

> Approximately calculate the number of distinct elements of a stream
> ---
>
> Key: FLINK-2148
> URL: https://issues.apache.org/jira/browse/FLINK-2148
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Assignee: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> In the paper
> http://people.seas.harvard.edu/~minilek/papers/f0.pdf
> Kane et al. describes an optimal algorithm for estimating the number of 
> distinct elements in a data stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284410#comment-15284410
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Yes i agree the api is the hard part, but we could work on this if you want or 
at least check if it is a mature task now considering stream api stability etc.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2146) Fast calculation of min/max with arbitrary eviction and triggers

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay resolved FLINK-2146.

Resolution: Not A Problem

I'm closing this, since with the redesigned windowing API (since 0.10) this 
became much less important. The previous windowing worked by always keeping 
track of one window only, by adding newly arriving elements and evicting old 
elements, and it fitted all kinds of windowing strategies into this model. This 
issue concerned the updating of min/max when the window is updated this way. 
However, the new windowing API (mostly to handle out of order events) keeps 
track of several windows at the same time, so solving this issue wouldn't 
effect sliding windows for example.

> Fast calculation of min/max with arbitrary eviction and triggers
> 
>
> Key: FLINK-2146
> URL: https://issues.apache.org/jira/browse/FLINK-2146
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>
> The last algorithm described here could be used:
> http://codercareer.blogspot.com/2012/02/no-33-maximums-in-sliding-windows.html
> It is based on a double-ended queue which maintains a sorted list of elements 
> of the current window that have the possibility of being the maximal element 
> in the future.
> Store: O(1) amortized
> Evict: O(1)
> emitWindow: O(1)
> memory: O(N)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay updated FLINK-2147:
---
Labels: approximate statistics  (was: statistics)

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2148) Approximately calculate the number of distinct elements of a stream

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay updated FLINK-2148:
---
Labels: approximate statistics  (was: statistics)

> Approximately calculate the number of distinct elements of a stream
> ---
>
> Key: FLINK-2148
> URL: https://issues.apache.org/jira/browse/FLINK-2148
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Assignee: Gabor Gevay
>Priority: Minor
>  Labels: approximate, statistics
>
> In the paper
> http://people.seas.harvard.edu/~minilek/papers/f0.pdf
> Kane et al. describes an optimal algorithm for estimating the number of 
> distinct elements in a data stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284427#comment-15284427
 ] 

Gabor Gevay commented on FLINK-2147:


OK, we can certainly think about it.

It would probably make sense to apply this to windows, and not the entire 
stream. In this case, we have to figure out how to apply the algorithm already 
when Flink is building the windows. (I mean we can't just stuff the algorithm 
into a WindowFunction, because then Flink would buffer up all the elements of a 
window, which defeats the purpose.) Maybe it can be done with a fold, since I 
think that Flink is already doing preaggregation for folds.

Another issue is how do we want the user to specify the key? I mean it would 
probably not be elegant to apply this to the entire elements, so maybe the user 
should specify a field on which to apply the algorithm. But then maybe it would 
be good if we could fit this with keyBy, by maybe adding this as a method of 
KeyedStream? I'm not sure.

Note that, for example [~aljoscha] might have a more informed opinion on all 
this.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284431#comment-15284431
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Ok... btw I was looking your old PR for median etc, i am wondering what is the 
status of memory management for window buffering in master.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2145) Median calculation for windows

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay resolved FLINK-2145.

Resolution: Won't Fix

I'm closing this, as it was based on assumptions of the old (pre-0.10) 
windowing. See https://github.com/apache/flink/pull/684#issuecomment-195402038

> Median calculation for windows
> --
>
> Key: FLINK-2145
> URL: https://issues.apache.org/jira/browse/FLINK-2145
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Assignee: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> The PreReducer for this has the following algorithm: We maintain two 
> multisets (as, for example, balanced binary search trees), that always 
> partition the elements of the current window to smaller-than-median and 
> larger-than-median elements. At each store and evict, we can maintain this 
> invariant with only O(1) multiset operations.
> Store: O(log N)
> Evict: O(log N)
> emitWindow: O(1)
> memory: O(N)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...

2016-05-16 Thread greghogan
Github user greghogan commented on a diff in the pull request:

https://github.com/apache/flink/pull/1956#discussion_r63346119
  
--- Diff: 
flink-libraries/flink-gelly-examples/src/test/java/org/apache/flink/graph/library/HITSAlgorithmITCase.java
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.graph.library;
+
+import org.apache.flink.api.java.ExecutionEnvironment;
+import org.apache.flink.api.java.tuple.Tuple2;
+import org.apache.flink.graph.Graph;
+import org.apache.flink.graph.Vertex;
+import org.apache.flink.graph.examples.data.HITSData;
+import org.apache.flink.test.util.MultipleProgramsTestBase;
+import org.apache.flink.types.DoubleValue;
+import org.apache.flink.types.NullValue;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.Arrays;
+import java.util.List;
+
+@RunWith(Parameterized.class)
+public class HITSAlgorithmITCase extends MultipleProgramsTestBase{
--- End diff --

Should all library and example algorithms be using `CollectionEnvironment`? 
(`env = ExecutionEnvironment.createCollectionsEnvironment();`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...

2016-05-16 Thread greghogan
Github user greghogan commented on a diff in the pull request:

https://github.com/apache/flink/pull/1956#discussion_r63346245
  
--- Diff: 
flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/data/HITSData.java
 ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.graph.examples.data;
+
+import org.apache.flink.api.java.DataSet;
+import org.apache.flink.api.java.ExecutionEnvironment;
+import org.apache.flink.graph.Edge;
+import org.apache.flink.graph.Vertex;
+import org.apache.flink.types.NullValue;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Provides the data set used for the HITS test program.
+ */
+public class HITSData {
+
+   public static final String VALUE_AFTER_3_ITERATIONS = "1,0.707,0.007\n" 
+
--- End diff --

Are we better off testing for convergence (or after a larger number of 
iterations) rather than testing an early state after a few iterations which 
seems particularly brittle?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2142) GSoC project: Exact and Approximate Statistics for Data Streams and Windows

2016-05-16 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284435#comment-15284435
 ] 

Gabor Gevay commented on FLINK-2142:


This proposal was based on the old (pre-0.10) windowing API. I'm now taking it 
apart, by converting sub-tasks to stand-alone issues (FLINK-2148, FLINK-2147) 
and/or modifying/closing those sub-tasks that don't make sense in the current 
streaming API. I will add the label `approximate` to those issues that are 
about approximate calculations.

Note: The main reason why I abandoned this project last summer, is that the 
streaming API was changing a lot at that time, so it seemed better to postpone 
these things.

> GSoC project: Exact and Approximate Statistics for Data Streams and Windows
> ---
>
> Key: FLINK-2142
> URL: https://issues.apache.org/jira/browse/FLINK-2142
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Assignee: Gabor Gevay
>Priority: Minor
>  Labels: gsoc2015, statistics, streaming
>
> The goal of this project is to implement basic statistics of data streams and 
> windows (like average, median, variance, correlation, etc.) in a 
> computationally efficient manner. This involves designing custom PreReducers.
> The exact calculation of some statistics (eg. frequencies, or the number of 
> distinct elements) would require memory proportional to the number of 
> elements in the input (the window or the entire stream). However, there are 
> efficient algorithms and data structures using less memory for calculating 
> the same statistics only approximately, with user-specified error bounds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284437#comment-15284437
 ] 

Gabor Gevay commented on FLINK-2147:


I'm actually now closing the incremental median and similar stuff, as it was 
based on the assumption of the old (pre-0.10) windowing API that events come in 
ordered by time, so it doesn't fit into the new API. (See 
https://github.com/apache/flink/pull/684#issuecomment-195402038)

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...

2016-05-16 Thread greghogan
Github user greghogan commented on a diff in the pull request:

https://github.com/apache/flink/pull/1956#discussion_r63346546
  
--- Diff: 
flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/data/HITSData.java
 ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.graph.examples.data;
+
+import org.apache.flink.api.java.DataSet;
+import org.apache.flink.api.java.ExecutionEnvironment;
+import org.apache.flink.graph.Edge;
+import org.apache.flink.graph.Vertex;
+import org.apache.flink.types.NullValue;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Provides the data set used for the HITS test program.
+ */
+public class HITSData {
+
+   public static final String VALUE_AFTER_3_ITERATIONS = "1,0.707,0.007\n" 
+
+   
"2,0.003,0.707\n" +
+   
"3,0.003,0.500\n" +
+   
"4,0.500,0.500\n" +
+   
"5,0.500,0.007\n";
+
+
+   private HITSData() {}
+
+   public static final DataSet> 
getVertexDataSet(ExecutionEnvironment env) {
+
+   List> vertices = new 
ArrayList>();
--- End diff --

 Can use the diamond operator here, `new ArrayList<>();`. IntelliJ is 
generally accurate in detecting unnecessary code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284440#comment-15284440
 ] 

Stavros Kontopoulos commented on FLINK-2147:


For sure if you apply it per window then you need to avoid keeping any data 
after you update your algorithm/structure. If you have window results i guess 
you can update a statistic about the whole stream when this is valid, depending 
on the statistic.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

2016-05-16 Thread greghogan
Github user greghogan commented on the pull request:

https://github.com/apache/flink/pull/1896#issuecomment-219416157
  
If there are no further review comments I'll look at merging this today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...

2016-05-16 Thread greghogan
Github user greghogan commented on a diff in the pull request:

https://github.com/apache/flink/pull/1956#discussion_r63347071
  
--- Diff: 
flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/HITSAlgorithm.java
 ---
@@ -0,0 +1,276 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.graph.library;
+
+import org.apache.flink.api.common.aggregators.DoubleSumAggregator;
+import org.apache.flink.api.common.functions.MapFunction;
+import org.apache.flink.api.common.typeinfo.TypeInformation;
+import org.apache.flink.api.java.DataSet;
+
+import org.apache.flink.api.java.tuple.Tuple2;
+import org.apache.flink.api.java.typeutils.TupleTypeInfo;
+import org.apache.flink.graph.Edge;
+import org.apache.flink.graph.EdgeDirection;
+import org.apache.flink.graph.Graph;
+import org.apache.flink.graph.GraphAlgorithm;
+import org.apache.flink.graph.Vertex;
+import org.apache.flink.graph.spargel.MessageIterator;
+import org.apache.flink.graph.spargel.MessagingFunction;
+import org.apache.flink.graph.spargel.ScatterGatherConfiguration;
+import org.apache.flink.graph.spargel.VertexUpdateFunction;
+import org.apache.flink.graph.utils.NullValueEdgeMapper;
+import org.apache.flink.types.DoubleValue;
+import org.apache.flink.types.NullValue;
+import org.apache.flink.util.Preconditions;
+
+/**
+ * This is an implementation of HITS algorithm, using a scatter-gather 
iteration.
+ * The user can define the maximum number of iterations. HITS algorithm is 
determined by two parameters,
+ * hubs and authorities. A good hub represents a page that points to many 
other pages, and a good authority
+ * represented a page that is linked by many different hubs.
+ * Each vertex has a value of Tuple2 type, the first field is hub score 
and the second field is authority score.
+ * The implementation assumes that the two score are the same in each 
vertex at the beginning.
+ * If the number of vertices of the input graph is known, it should be 
provided as a parameter
+ * to speed up computation. Otherwise, the algorithm will first execute a 
job to count the vertices.
+ * 
+ *
+ * @see https://en.wikipedia.org/wiki/HITS_algorithm";>HITS 
Algorithm
+ */
+public class HITSAlgorithm implements GraphAlgorithm>>> {
+
+   private final static int MAXIMUMITERATION = (Integer.MAX_VALUE - 1) / 2;
+   private final static double MINIMUMTHRESHOLD = 1e-9;
+
+   private int maxIterations;
+   private long numberOfVertices;
+   private double convergeThreshold;
+
+   public HITSAlgorithm(int maxIterations) {
+   this(maxIterations, MINIMUMTHRESHOLD);
+   }
+
+   public HITSAlgorithm(double convergeThreshold) {
+   this(MAXIMUMITERATION, convergeThreshold);
+   }
+
+   public HITSAlgorithm(int maxIterations, long numberOfVertices) {
+   this(maxIterations, MINIMUMTHRESHOLD, numberOfVertices);
+   }
+
+   public HITSAlgorithm(double convergeThreshold, long numberOfVertices) {
+   this(MAXIMUMITERATION, convergeThreshold, numberOfVertices);
+   }
+
+   /**
+* Creates an instance of HITS algorithm.
+*
+* @param maxIterations the maximum number of iterations
+* @param convergeThreshold convergence threshold for sum of scores
+*/
+   public HITSAlgorithm(int maxIterations, double convergeThreshold) {
+   Preconditions.checkArgument(maxIterations > 0, "Number of 
iterations must be greater than zero.");
+   Preconditions.checkArgument(convergeThreshold > 0.0, 
"Convergence threshold must be greater than zero.");
+   this.maxIterations = maxIterations * 2 + 1;
+   this.convergeThreshold = convergeThreshold;
+   }
+
+   /**
+* Creates an instance of HITS algorithm.
+*
+* @param maxIterations the maximum number of iterations
+* @param convergeThreshold convergence t

[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284446#comment-15284446
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Ok so if there multiple windows evaluated at different times in parallel since 
data comes out of order, what kind of statistic is computable in this model? 
What are the correct semantics here? Emit a statistic update only when ordering 
is reconstructed (appropriate windows are calculated) and delay future results? 
What about count min?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284467#comment-15284467
 ] 

Gabor Gevay commented on FLINK-2147:


In my opinion, the semantics would be to calculate the statistic only about 
each window separately. When to emit is handled by the triggers (as with other 
windowing calculations in Flink.) (Note that the windows can be quite large, 
like weekly or monthly.)

I think that having a statistic about the entire stream is rarely what the user 
actually wants. Flink programs are designed to run indefinitely for a long 
time, and the starting point of a stream is just when the user happened to 
start the Flink program, which might have no real semantic meaning if the Flink 
program is analyzing some external system.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284468#comment-15284468
 ] 

Gabor Gevay commented on FLINK-2147:


(Note that the out-of-order problem that I mentioned for the incremental median 
stuff was about the situation that I wanted to re-use some part of the 
calculations across different windows (hence the word "incremental"). This is 
probably not necessary here, we can just work on separate windows separately.)

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2144) Implement count, average, and variance for windows

2016-05-16 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284473#comment-15284473
 ] 

Gabor Gevay commented on FLINK-2144:


Originally, I wanted to make this calculation incremental both on new elements 
and on evictions. With the new windowing API, the eviction part lost its 
importance. However, making it incremental for new elements is still valid, as 
this would allow making these calculations without buffering up the window. I 
think this essentially means that these should be implemented in FoldFunctions, 
which should be automatically pre-aggregated by Flink.

> Implement count, average, and variance for windows
> --
>
> Key: FLINK-2144
> URL: https://issues.apache.org/jira/browse/FLINK-2144
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Assignee: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> By count I mean the number of elements in the window.
> These can be implemented very efficiently building on FLINK-2143:
> Store: O(1)
> Evict: O(1)
> emitWindow: O(1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2144) Implement count, average, and variance for windows

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay updated FLINK-2144:
---
Issue Type: New Feature  (was: Sub-task)
Parent: (was: FLINK-2142)

> Implement count, average, and variance for windows
> --
>
> Key: FLINK-2144
> URL: https://issues.apache.org/jira/browse/FLINK-2144
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Assignee: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> By count I mean the number of elements in the window.
> These can be implemented very efficiently building on FLINK-2143:
> Store: O(1)
> Evict: O(1)
> emitWindow: O(1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2144) Implement incremental count, average, and variance for windows

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay updated FLINK-2144:
---
Assignee: (was: Gabor Gevay)

> Implement incremental count, average, and variance for windows
> --
>
> Key: FLINK-2144
> URL: https://issues.apache.org/jira/browse/FLINK-2144
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> By count I mean the number of elements in the window.
> These can be implemented very efficiently building on FLINK-2143:
> Store: O(1)
> Evict: O(1)
> emitWindow: O(1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2144) Implement incremental count, average, and variance for windows

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay updated FLINK-2144:
---
Summary: Implement incremental count, average, and variance for windows  
(was: Implement count, average, and variance for windows)

> Implement incremental count, average, and variance for windows
> --
>
> Key: FLINK-2144
> URL: https://issues.apache.org/jira/browse/FLINK-2144
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Assignee: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> By count I mean the number of elements in the window.
> These can be implemented very efficiently building on FLINK-2143:
> Store: O(1)
> Evict: O(1)
> emitWindow: O(1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2142) GSoC project: Exact and Approximate Statistics for Data Streams and Windows

2016-05-16 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284474#comment-15284474
 ] 

Gabor Gevay commented on FLINK-2142:


(I've also broken off FLINK-2144.)

> GSoC project: Exact and Approximate Statistics for Data Streams and Windows
> ---
>
> Key: FLINK-2142
> URL: https://issues.apache.org/jira/browse/FLINK-2142
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Assignee: Gabor Gevay
>Priority: Minor
>  Labels: gsoc2015, statistics, streaming
>
> The goal of this project is to implement basic statistics of data streams and 
> windows (like average, median, variance, correlation, etc.) in a 
> computationally efficient manner. This involves designing custom PreReducers.
> The exact calculation of some statistics (eg. frequencies, or the number of 
> distinct elements) would require memory proportional to the number of 
> elements in the input (the window or the entire stream). However, there are 
> efficient algorithms and data structures using less memory for calculating 
> the same statistics only approximately, with user-specified error bounds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2144) Incremental count, average, and variance for windows

2016-05-16 Thread Gabor Gevay (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Gevay updated FLINK-2144:
---
Summary: Incremental count, average, and variance for windows  (was: 
Implement incremental count, average, and variance for windows)

> Incremental count, average, and variance for windows
> 
>
> Key: FLINK-2144
> URL: https://issues.apache.org/jira/browse/FLINK-2144
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> By count I mean the number of elements in the window.
> These can be implemented very efficiently building on FLINK-2143:
> Store: O(1)
> Evict: O(1)
> emitWindow: O(1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284476#comment-15284476
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Ok i agree then we calculate statistics per window in isolated manner like sum, 
mean etc without the aggregation in buffer. Ok so lets see how we avoid that 
correct?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284479#comment-15284479
 ] 

Gabor Gevay commented on FLINK-2147:


Yes. We should probably look in the direction of FoldFunction, since Flink does 
preaggregation for these, if I'm not mistaken.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284484#comment-15284484
 ] 

Stavros Kontopoulos commented on FLINK-2147:


ok i will have a look as well to get familiar with it.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...

2016-05-16 Thread gallenvara
Github user gallenvara commented on a diff in the pull request:

https://github.com/apache/flink/pull/1956#discussion_r63353881
  
--- Diff: 
flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/data/HITSData.java
 ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.graph.examples.data;
+
+import org.apache.flink.api.java.DataSet;
+import org.apache.flink.api.java.ExecutionEnvironment;
+import org.apache.flink.graph.Edge;
+import org.apache.flink.graph.Vertex;
+import org.apache.flink.types.NullValue;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Provides the data set used for the HITS test program.
+ */
+public class HITSData {
+
+   public static final String VALUE_AFTER_3_ITERATIONS = "1,0.707,0.007\n" 
+
--- End diff --

Yes, i used approximate number of the result.
I would replace the number of iterations with a larger one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...

2016-05-16 Thread gallenvara
Github user gallenvara commented on a diff in the pull request:

https://github.com/apache/flink/pull/1956#discussion_r63354317
  
--- Diff: 
flink-libraries/flink-gelly-examples/src/test/java/org/apache/flink/graph/library/HITSAlgorithmITCase.java
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.graph.library;
+
+import org.apache.flink.api.java.ExecutionEnvironment;
+import org.apache.flink.api.java.tuple.Tuple2;
+import org.apache.flink.graph.Graph;
+import org.apache.flink.graph.Vertex;
+import org.apache.flink.graph.examples.data.HITSData;
+import org.apache.flink.test.util.MultipleProgramsTestBase;
+import org.apache.flink.types.DoubleValue;
+import org.apache.flink.types.NullValue;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.util.Arrays;
+import java.util.List;
+
+@RunWith(Parameterized.class)
+public class HITSAlgorithmITCase extends MultipleProgramsTestBase{
--- End diff --

In the `MultipleProgramsTestBase`, the default test mode is `COLLECTION`. 
Should we specify the mode manually?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...

2016-05-16 Thread gallenvara
Github user gallenvara commented on a diff in the pull request:

https://github.com/apache/flink/pull/1956#discussion_r63354507
  
--- Diff: 
flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/data/HITSData.java
 ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.graph.examples.data;
+
+import org.apache.flink.api.java.DataSet;
+import org.apache.flink.api.java.ExecutionEnvironment;
+import org.apache.flink.graph.Edge;
+import org.apache.flink.graph.Vertex;
+import org.apache.flink.types.NullValue;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Provides the data set used for the HITS test program.
+ */
+public class HITSData {
+
+   public static final String VALUE_AFTER_3_ITERATIONS = "1,0.707,0.007\n" 
+
+   
"2,0.003,0.707\n" +
+   
"3,0.003,0.500\n" +
+   
"4,0.500,0.500\n" +
+   
"5,0.500,0.007\n";
+
+
+   private HITSData() {}
+
+   public static final DataSet> 
getVertexDataSet(ExecutionEnvironment env) {
+
+   List> vertices = new 
ArrayList>();
--- End diff --

OK, i will modify relevant codes in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

2016-05-16 Thread vasia
Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/1896#issuecomment-219454345
  
Hi @greghogan,
I haven't had time to review this PR. If this is blocking you or is urgent 
for some reason, please go ahead. Otherwise, I could take a look later this 
week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3768) Clustering Coefficient

2016-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285067#comment-15285067
 ] 

ASF GitHub Bot commented on FLINK-3768:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/1896


> Clustering Coefficient
> --
>
> Key: FLINK-3768
> URL: https://issues.apache.org/jira/browse/FLINK-3768
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> The local clustering coefficient measures the connectedness of each vertex's 
> neighborhood. Values range from 0.0 (no edges between neighbors) to 1.0 
> (neighborhood is a clique).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient

2016-05-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/1896


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Closed] (FLINK-3768) Clustering Coefficient

2016-05-16 Thread Greg Hogan (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Hogan closed FLINK-3768.
-
Resolution: Implemented

Implemented in c71675f7cbe7e538d62bf1491aff69b369eda9eb

> Clustering Coefficient
> --
>
> Key: FLINK-3768
> URL: https://issues.apache.org/jira/browse/FLINK-3768
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 1.1.0
>Reporter: Greg Hogan
>Assignee: Greg Hogan
> Fix For: 1.1.0
>
>
> The local clustering coefficient measures the connectedness of each vertex's 
> neighborhood. Values range from 0.0 (no edges between neighbors) to 1.0 
> (neighborhood is a clique).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2829] Confusing error message when Flin...

2016-05-16 Thread rekhajoshm
GitHub user rekhajoshm opened a pull request:

https://github.com/apache/flink/pull/1993

[FLINK-2829] Confusing error message when Flink cannot create enough task 
threads

[FLINK-2829] Confusing error message when Flink cannot create enough task 
threads

Clarifying the flink runtime error message on slot details on make sense 
for user facing them when Flink cannot create enough task threads.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rekhajoshm/flink FLINK-2829

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1993.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1993


commit b1f4c11f1878d53656c9ca49c6912a95a449f18e
Author: Joshi 
Date:   2016-05-16T21:14:23Z

[FLINK-2829] Clarifying error message when Flink cannot create enough task 
threads




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2829) Confusing error message when Flink cannot create enough task threads

2016-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285315#comment-15285315
 ] 

ASF GitHub Bot commented on FLINK-2829:
---

GitHub user rekhajoshm opened a pull request:

https://github.com/apache/flink/pull/1993

[FLINK-2829] Confusing error message when Flink cannot create enough task 
threads

[FLINK-2829] Confusing error message when Flink cannot create enough task 
threads

Clarifying the flink runtime error message on slot details on make sense 
for user facing them when Flink cannot create enough task threads.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rekhajoshm/flink FLINK-2829

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1993.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1993


commit b1f4c11f1878d53656c9ca49c6912a95a449f18e
Author: Joshi 
Date:   2016-05-16T21:14:23Z

[FLINK-2829] Clarifying error message when Flink cannot create enough task 
threads




> Confusing error message when Flink cannot create enough task threads
> 
>
> Key: FLINK-2829
> URL: https://issues.apache.org/jira/browse/FLINK-2829
> Project: Flink
>  Issue Type: Improvement
>  Components: JobManager, TaskManager
>Reporter: Gyula Fora
>Priority: Trivial
>
> When Flink runs out of memory while creating too many task threads, the error 
> message received from the job manager is slightly confusing:
> java.lang.Exception: Failed to deploy the task to slot SimpleSlot (1)(63) - 
> eea7250ab5b368693e3c4f14fb94f86d @ localhost - 8 slots - URL: 
> akka://flink/user/taskmanager_1 - ALLOCATED/ALIVE: Response was not of type 
> Acknowledge 
> at 
> org.apache.flink.runtime.executiongraph.Execution$2.onComplete(Execution.java:392)
>  
> at akka.dispatch.OnComplete.internal(Future.scala:247) 
> at akka.dispatch.OnComplete.internal(Future.scala:244) 
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) 
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) 
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 
> at 
> scala.concurrent.impl.ExecutionContextImpl$anon$3.exec(ExecutionContextImpl.scala:107)
>  
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Although the error comes from the Taskmanager:
> java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:714)
>   at 
> org.apache.flink.runtime.taskmanager.Task.startTaskThread(Task.java:415)
>   at 
> org.apache.flink.runtime.taskmanager.TaskManager.submitTask(TaskManager.scala:904)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3900] Set nullCheck=true as default in ...

2016-05-16 Thread rekhajoshm
GitHub user rekhajoshm opened a pull request:

https://github.com/apache/flink/pull/1994

[FLINK-3900] Set nullCheck=true as default in TableConfig

[FLINK-3900] Settig nullCheck as true as default in TableConfig

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rekhajoshm/flink FLINK-3900

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1994.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1994


commit 3a4f10a383a589f3880d06a6ad140498d8a521f6
Author: Joshi 
Date:   2016-05-16T21:25:12Z

[FLINK-3900] Set nullCheck=true as default in TableConfig




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3900) Set nullCheck=true as default in TableConfig

2016-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285339#comment-15285339
 ] 

ASF GitHub Bot commented on FLINK-3900:
---

GitHub user rekhajoshm opened a pull request:

https://github.com/apache/flink/pull/1994

[FLINK-3900] Set nullCheck=true as default in TableConfig

[FLINK-3900] Settig nullCheck as true as default in TableConfig

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rekhajoshm/flink FLINK-3900

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1994.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1994


commit 3a4f10a383a589f3880d06a6ad140498d8a521f6
Author: Joshi 
Date:   2016-05-16T21:25:12Z

[FLINK-3900] Set nullCheck=true as default in TableConfig




> Set nullCheck=true as default in TableConfig
> 
>
> Key: FLINK-3900
> URL: https://issues.apache.org/jira/browse/FLINK-3900
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API
>Affects Versions: 1.1.0
>Reporter: Flavio Pompermaier
>Priority: Minor
>
> As discussed with Fabian, TableConfig should use nullCheck=true as default to 
> allow for null values in the data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-3782] ByteArrayOutputStream and ObjectO...

2016-05-16 Thread rekhajoshm
GitHub user rekhajoshm opened a pull request:

https://github.com/apache/flink/pull/1995

[FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close

[FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close

The ByteArrayOutputStream close method is useless and has no impact, so is 
usually never called.However I am using try with resources for both to take 
care of closing closeable resources automatically.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rekhajoshm/flink FLINK-3782

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1995.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1995


commit 2fec41be0f1ef8f1b9f707d0085d6f9fca8101bb
Author: Joshi 
Date:   2016-05-16T22:18:42Z

[FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-3782) ByteArrayOutputStream and ObjectOutputStream should close

2016-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285538#comment-15285538
 ] 

ASF GitHub Bot commented on FLINK-3782:
---

GitHub user rekhajoshm opened a pull request:

https://github.com/apache/flink/pull/1995

[FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close

[FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close

The ByteArrayOutputStream close method is useless and has no impact, so is 
usually never called.However I am using try with resources for both to take 
care of closing closeable resources automatically.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rekhajoshm/flink FLINK-3782

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1995.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1995


commit 2fec41be0f1ef8f1b9f707d0085d6f9fca8101bb
Author: Joshi 
Date:   2016-05-16T22:18:42Z

[FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close




> ByteArrayOutputStream and ObjectOutputStream should close
> -
>
> Key: FLINK-3782
> URL: https://issues.apache.org/jira/browse/FLINK-3782
> Project: Flink
>  Issue Type: Test
>  Components: Java API
>Affects Versions: 1.0.1
>Reporter: Chenguang He
>Priority: Minor
>  Labels: test
>
> ByteArrayOutputStream and ObjectOutputStream should close
> @Test
>   public void testSerializability() {
>   try {
>   Collection inputCollection = new 
> ArrayList();
>   ElementType element1 = new ElementType(1);
>   ElementType element2 = new ElementType(2);
>   ElementType element3 = new ElementType(3);
>   inputCollection.add(element1);
>   inputCollection.add(element2);
>   inputCollection.add(element3);
>   
>   @SuppressWarnings("unchecked")
>   TypeInformation info = 
> (TypeInformation) 
> TypeExtractor.createTypeInfo(ElementType.class);
>   
>   CollectionInputFormat inputFormat = new 
> CollectionInputFormat(inputCollection,
>   info.createSerializer(new 
> ExecutionConfig()));
>   ByteArrayOutputStream buffer = new 
> ByteArrayOutputStream();//    ObjectOutputStream out = new 
> ObjectOutputStream(buffer);//    out.writeObject(inputFormat);
>   ObjectInputStream in = new ObjectInputStream(new 
> ByteArrayInputStream(buffer.toByteArray()));  //    Object serializationResult = in.readObject();
>   assertNotNull(serializationResult);
>   assertTrue(serializationResult instanceof 
> CollectionInputFormat);
>   @SuppressWarnings("unchecked")
>   CollectionInputFormat result = 
> (CollectionInputFormat) serializationResult;
>   GenericInputSplit inputSplit = new GenericInputSplit(0, 
> 1);
>   inputFormat.open(inputSplit);
>   result.open(inputSplit);
>   while(!inputFormat.reachedEnd() && 
> !result.reachedEnd()){
>   ElementType expectedElement = 
> inputFormat.nextRecord(null);
>   ElementType actualElement = 
> result.nextRecord(null);
>   assertEquals(expectedElement, actualElement);
>   }
>   }
>   catch(Exception e) {
>   e.printStackTrace();
>   fail(e.toString());
>   }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...

2016-05-16 Thread gallenvara
Github user gallenvara commented on the pull request:

https://github.com/apache/flink/pull/1956#issuecomment-219605318
  
@greghogan thanks for your advice and relevant codes have been modified. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2044) Implementation of Gelly HITS Algorithm

2016-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285904#comment-15285904
 ] 

ASF GitHub Bot commented on FLINK-2044:
---

Github user gallenvara commented on the pull request:

https://github.com/apache/flink/pull/1956#issuecomment-219605318
  
@greghogan thanks for your advice and relevant codes have been modified. :)


> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Ahamd Javid
>Assignee: GaoLun
>Priority: Minor
>
> Implementation of Hits Algorithm in Gelly API using Java. the feature branch 
> can be found here: (https://github.com/JavidMayar/flink/commits/HITS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-3915) Updating a Dataset in Flink using Flink's Table API

2016-05-16 Thread Akshay Shingote (JIRA)
Akshay Shingote created FLINK-3915:
--

 Summary: Updating a Dataset in Flink using Flink's Table API
 Key: FLINK-3915
 URL: https://issues.apache.org/jira/browse/FLINK-3915
 Project: Flink
  Issue Type: Wish
  Components: Table API
Affects Versions: 1.0.1
Reporter: Akshay Shingote


I am using Flink Table API. I want to update the table in Flink through Pattern 
Detection..I am using 3 fields : routeno,source,distance,category . Now I want 
to update the category based on the value of distance for every routeno...For 
ex : if routeno=1 and distance<=200 then category='daily' ..How can I update 
this using Flink's Table API?? Can we update dataset or Table using Flink's 
Table API ??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)