[GitHub] flink pull request: [FLINK-2184] Cannot get last element with maxB...
Github user gallenvara commented on the pull request: https://github.com/apache/flink/pull/1975#issuecomment-219362150 Hi @fhueske , can you help review this PR? Thanks! :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1873) Distributed matrix implementation
[ https://issues.apache.org/jira/browse/FLINK-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284278#comment-15284278 ] Simone Robutti commented on FLINK-1873: --- The tasks are not independent so I would go for sub-tasks if necessary but I will adapt to what you consider better. Just let me know. > Distributed matrix implementation > - > > Key: FLINK-1873 > URL: https://issues.apache.org/jira/browse/FLINK-1873 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: liaoyuxi >Assignee: Simone Robutti > Labels: ML > > It would help to implement machine learning algorithm more quickly and > concise if Flink would provide support for storing data and computation in > distributed matrix. The design of the implementation is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3911) Sort operation before a group reduce doesn't seem to be implemented on 1.0.2
[ https://issues.apache.org/jira/browse/FLINK-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284291#comment-15284291 ] Fabian Hueske commented on FLINK-3911: -- Hi Patrice, that's a very good suggestion! In fact, we have an open JIRA for this: FLINK-2399. I'm closing this issue since the problem has been resolved. Thanks for reporting! > Sort operation before a group reduce doesn't seem to be implemented on 1.0.2 > > > Key: FLINK-3911 > URL: https://issues.apache.org/jira/browse/FLINK-3911 > Project: Flink > Issue Type: Bug >Affects Versions: 1.0.2 > Environment: Linux Ubuntu, standalone cluster >Reporter: Patrice Freydiere > Labels: newbie > > i have this piece of code: > // group by id and sort on field order > DataSet> waysGeometry = > joinedWaysWithPoints.groupBy(0).sortGroup(1, Order.ASCENDING) > .reduceGroup(new > GroupReduceFunction, Tuple2 byte[]>>() { > @Override > public void > reduce(Iterable> values, > > Collector> out) throws Exception { > long id = -1; > and this exception when executing ; > ava.lang.Exception: The data preparation for task 'GroupReduce (GroupReduce > at constructOSMStreams(ProcessOSM.java:112))' , caused an error: Unrecognized > driver strategy for GroupReduce driver: SORTED_GROUP_COMBINE > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:456) > at > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.Exception: Unrecognized driver strategy for GroupReduce > driver: SORTED_GROUP_COMBINE > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:90) > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450) > ... 3 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-3911) Sort operation before a group reduce doesn't seem to be implemented on 1.0.2
[ https://issues.apache.org/jira/browse/FLINK-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Hueske closed FLINK-3911. Resolution: Not A Bug > Sort operation before a group reduce doesn't seem to be implemented on 1.0.2 > > > Key: FLINK-3911 > URL: https://issues.apache.org/jira/browse/FLINK-3911 > Project: Flink > Issue Type: Bug >Affects Versions: 1.0.2 > Environment: Linux Ubuntu, standalone cluster >Reporter: Patrice Freydiere > Labels: newbie > > i have this piece of code: > // group by id and sort on field order > DataSet> waysGeometry = > joinedWaysWithPoints.groupBy(0).sortGroup(1, Order.ASCENDING) > .reduceGroup(new > GroupReduceFunction, Tuple2 byte[]>>() { > @Override > public void > reduce(Iterable> values, > > Collector> out) throws Exception { > long id = -1; > and this exception when executing ; > ava.lang.Exception: The data preparation for task 'GroupReduce (GroupReduce > at constructOSMStreams(ProcessOSM.java:112))' , caused an error: Unrecognized > driver strategy for GroupReduce driver: SORTED_GROUP_COMBINE > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:456) > at > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.Exception: Unrecognized driver strategy for GroupReduce > driver: SORTED_GROUP_COMBINE > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:90) > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450) > ... 3 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3911) Sort operation before a group reduce doesn't seem to be implemented on 1.0.2
[ https://issues.apache.org/jira/browse/FLINK-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284296#comment-15284296 ] Patrice Freydiere commented on FLINK-3911: -- hi fabian, i'm currently working on using flink for geospatial / ML stack, hope i can make some communications. thank's for your help, Patrice Le lun. 16 mai 2016 à 11:21, Fabian Hueske (JIRA) a > Sort operation before a group reduce doesn't seem to be implemented on 1.0.2 > > > Key: FLINK-3911 > URL: https://issues.apache.org/jira/browse/FLINK-3911 > Project: Flink > Issue Type: Bug >Affects Versions: 1.0.2 > Environment: Linux Ubuntu, standalone cluster >Reporter: Patrice Freydiere > Labels: newbie > > i have this piece of code: > // group by id and sort on field order > DataSet> waysGeometry = > joinedWaysWithPoints.groupBy(0).sortGroup(1, Order.ASCENDING) > .reduceGroup(new > GroupReduceFunction, Tuple2 byte[]>>() { > @Override > public void > reduce(Iterable> values, > > Collector> out) throws Exception { > long id = -1; > and this exception when executing ; > ava.lang.Exception: The data preparation for task 'GroupReduce (GroupReduce > at constructOSMStreams(ProcessOSM.java:112))' , caused an error: Unrecognized > driver strategy for GroupReduce driver: SORTED_GROUP_COMBINE > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:456) > at > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.Exception: Unrecognized driver strategy for GroupReduce > driver: SORTED_GROUP_COMBINE > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:90) > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450) > ... 3 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284312#comment-15284312 ] Gabor Gevay commented on FLINK-2147: Unfortunately, no. > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: Sub-task > Components: Streaming >Reporter: Gabor Gevay >Priority: Minor > Labels: statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1827) Move test classes in test folders and fix scope of test dependencies
[ https://issues.apache.org/jira/browse/FLINK-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284315#comment-15284315 ] Stefano Bortoli commented on FLINK-1827: Hi Josh, first of all, sorry for the troubles caused! FYI, we are discussing about merging the test-utils classes into flink-tests component and remove it. I understand your points, and I am happy you could work around the dependencies configuration. The main objective of this activity was to make the build of flink faster, skipping tests (-Dmaven.test.skip=true). We to did it trying to follow maven best practices, but if you are willing to contribute any changes, you are welcome to do it. > Move test classes in test folders and fix scope of test dependencies > > > Key: FLINK-1827 > URL: https://issues.apache.org/jira/browse/FLINK-1827 > Project: Flink > Issue Type: Improvement > Components: Build System >Affects Versions: 0.9 >Reporter: Flavio Pompermaier >Priority: Minor > Labels: test-compile > Fix For: 1.1.0 > > Original Estimate: 4h > Remaining Estimate: 4h > > Right now it is not possible to avoid compilation of test classes > (-Dmaven.test.skip=true) because some project (e.g. flink-test-utils) > requires test classes in non-test sources (e.g. > scalatest_${scala.binary.version}) > Test classes should be moved to src/main/test (if Java) and src/test/scala > (if scala) and use scope=test for test dependencies -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284360#comment-15284360 ] Stavros Kontopoulos commented on FLINK-2147: Would be ok to give it a shot in the long term, start with a short design document etc (although im newbie with Flink)? > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: Sub-task > Components: Streaming >Reporter: Gabor Gevay >Priority: Minor > Labels: statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284406#comment-15284406 ] Gabor Gevay commented on FLINK-2147: Yes, I think it would be very good if you worked on this, and starting with a design document is a good idea. However, if you haven't yet contributed to Flink, then I would suggest starting with some simpler tasks first. Note, that the hard part here will not be implementing the algorithm that is described in the linked paper, but figuring out how it fits into the API and internals of Flink. (This was a sub-task of my Google Summer of Code project last summer, but then I haven't really worked on this, because the streaming API was changing a lot at that time. I will convert this sub-task to a stand-alone issue now.) > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: Sub-task > Components: Streaming >Reporter: Gabor Gevay >Priority: Minor > Labels: statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay updated FLINK-2147: --- Issue Type: New Feature (was: Sub-task) Parent: (was: FLINK-2142) > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Priority: Minor > Labels: statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay updated FLINK-2147: --- Priority: Major (was: Minor) > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2148] [contrib] Exact and approximate c...
Github user ggevay commented on the pull request: https://github.com/apache/flink/pull/910#issuecomment-219407507 I'm closing this. It doesn't really make much sense to do this on an entire stream; users would probably want to do this on large windows instead, so this should probably be reimplemented for the new windowing API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2148] [contrib] Exact and approximate c...
Github user ggevay closed the pull request at: https://github.com/apache/flink/pull/910 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-2148) Approximately calculate the number of distinct elements of a stream
[ https://issues.apache.org/jira/browse/FLINK-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay updated FLINK-2148: --- Issue Type: New Feature (was: Sub-task) Parent: (was: FLINK-2142) > Approximately calculate the number of distinct elements of a stream > --- > > Key: FLINK-2148 > URL: https://issues.apache.org/jira/browse/FLINK-2148 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Assignee: Gabor Gevay >Priority: Minor > Labels: statistics > > In the paper > http://people.seas.harvard.edu/~minilek/papers/f0.pdf > Kane et al. describes an optimal algorithm for estimating the number of > distinct elements in a data stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284410#comment-15284410 ] Stavros Kontopoulos commented on FLINK-2147: Yes i agree the api is the hard part, but we could work on this if you want or at least check if it is a mature task now considering stream api stability etc. > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2146) Fast calculation of min/max with arbitrary eviction and triggers
[ https://issues.apache.org/jira/browse/FLINK-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay resolved FLINK-2146. Resolution: Not A Problem I'm closing this, since with the redesigned windowing API (since 0.10) this became much less important. The previous windowing worked by always keeping track of one window only, by adding newly arriving elements and evicting old elements, and it fitted all kinds of windowing strategies into this model. This issue concerned the updating of min/max when the window is updated this way. However, the new windowing API (mostly to handle out of order events) keeps track of several windows at the same time, so solving this issue wouldn't effect sliding windows for example. > Fast calculation of min/max with arbitrary eviction and triggers > > > Key: FLINK-2146 > URL: https://issues.apache.org/jira/browse/FLINK-2146 > Project: Flink > Issue Type: Sub-task > Components: Streaming >Reporter: Gabor Gevay >Priority: Minor > > The last algorithm described here could be used: > http://codercareer.blogspot.com/2012/02/no-33-maximums-in-sliding-windows.html > It is based on a double-ended queue which maintains a sorted list of elements > of the current window that have the possibility of being the maximal element > in the future. > Store: O(1) amortized > Evict: O(1) > emitWindow: O(1) > memory: O(N) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay updated FLINK-2147: --- Labels: approximate statistics (was: statistics) > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2148) Approximately calculate the number of distinct elements of a stream
[ https://issues.apache.org/jira/browse/FLINK-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay updated FLINK-2148: --- Labels: approximate statistics (was: statistics) > Approximately calculate the number of distinct elements of a stream > --- > > Key: FLINK-2148 > URL: https://issues.apache.org/jira/browse/FLINK-2148 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Assignee: Gabor Gevay >Priority: Minor > Labels: approximate, statistics > > In the paper > http://people.seas.harvard.edu/~minilek/papers/f0.pdf > Kane et al. describes an optimal algorithm for estimating the number of > distinct elements in a data stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284427#comment-15284427 ] Gabor Gevay commented on FLINK-2147: OK, we can certainly think about it. It would probably make sense to apply this to windows, and not the entire stream. In this case, we have to figure out how to apply the algorithm already when Flink is building the windows. (I mean we can't just stuff the algorithm into a WindowFunction, because then Flink would buffer up all the elements of a window, which defeats the purpose.) Maybe it can be done with a fold, since I think that Flink is already doing preaggregation for folds. Another issue is how do we want the user to specify the key? I mean it would probably not be elegant to apply this to the entire elements, so maybe the user should specify a field on which to apply the algorithm. But then maybe it would be good if we could fit this with keyBy, by maybe adding this as a method of KeyedStream? I'm not sure. Note that, for example [~aljoscha] might have a more informed opinion on all this. > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284431#comment-15284431 ] Stavros Kontopoulos commented on FLINK-2147: Ok... btw I was looking your old PR for median etc, i am wondering what is the status of memory management for window buffering in master. > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2145) Median calculation for windows
[ https://issues.apache.org/jira/browse/FLINK-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay resolved FLINK-2145. Resolution: Won't Fix I'm closing this, as it was based on assumptions of the old (pre-0.10) windowing. See https://github.com/apache/flink/pull/684#issuecomment-195402038 > Median calculation for windows > -- > > Key: FLINK-2145 > URL: https://issues.apache.org/jira/browse/FLINK-2145 > Project: Flink > Issue Type: Sub-task > Components: Streaming >Reporter: Gabor Gevay >Assignee: Gabor Gevay >Priority: Minor > Labels: statistics > > The PreReducer for this has the following algorithm: We maintain two > multisets (as, for example, balanced binary search trees), that always > partition the elements of the current window to smaller-than-median and > larger-than-median elements. At each store and evict, we can maintain this > invariant with only O(1) multiset operations. > Store: O(log N) > Evict: O(log N) > emitWindow: O(1) > memory: O(N) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/1956#discussion_r63346119 --- Diff: flink-libraries/flink-gelly-examples/src/test/java/org/apache/flink/graph/library/HITSAlgorithmITCase.java --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.graph.library; + +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.graph.Graph; +import org.apache.flink.graph.Vertex; +import org.apache.flink.graph.examples.data.HITSData; +import org.apache.flink.test.util.MultipleProgramsTestBase; +import org.apache.flink.types.DoubleValue; +import org.apache.flink.types.NullValue; +import org.junit.Assert; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.Parameterized; + +import java.util.Arrays; +import java.util.List; + +@RunWith(Parameterized.class) +public class HITSAlgorithmITCase extends MultipleProgramsTestBase{ --- End diff -- Should all library and example algorithms be using `CollectionEnvironment`? (`env = ExecutionEnvironment.createCollectionsEnvironment();`) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/1956#discussion_r63346245 --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/data/HITSData.java --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.graph.examples.data; + +import org.apache.flink.api.java.DataSet; +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.graph.Edge; +import org.apache.flink.graph.Vertex; +import org.apache.flink.types.NullValue; + +import java.util.ArrayList; +import java.util.List; + +/** + * Provides the data set used for the HITS test program. + */ +public class HITSData { + + public static final String VALUE_AFTER_3_ITERATIONS = "1,0.707,0.007\n" + --- End diff -- Are we better off testing for convergence (or after a larger number of iterations) rather than testing an early state after a few iterations which seems particularly brittle? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2142) GSoC project: Exact and Approximate Statistics for Data Streams and Windows
[ https://issues.apache.org/jira/browse/FLINK-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284435#comment-15284435 ] Gabor Gevay commented on FLINK-2142: This proposal was based on the old (pre-0.10) windowing API. I'm now taking it apart, by converting sub-tasks to stand-alone issues (FLINK-2148, FLINK-2147) and/or modifying/closing those sub-tasks that don't make sense in the current streaming API. I will add the label `approximate` to those issues that are about approximate calculations. Note: The main reason why I abandoned this project last summer, is that the streaming API was changing a lot at that time, so it seemed better to postpone these things. > GSoC project: Exact and Approximate Statistics for Data Streams and Windows > --- > > Key: FLINK-2142 > URL: https://issues.apache.org/jira/browse/FLINK-2142 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Assignee: Gabor Gevay >Priority: Minor > Labels: gsoc2015, statistics, streaming > > The goal of this project is to implement basic statistics of data streams and > windows (like average, median, variance, correlation, etc.) in a > computationally efficient manner. This involves designing custom PreReducers. > The exact calculation of some statistics (eg. frequencies, or the number of > distinct elements) would require memory proportional to the number of > elements in the input (the window or the entire stream). However, there are > efficient algorithms and data structures using less memory for calculating > the same statistics only approximately, with user-specified error bounds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284437#comment-15284437 ] Gabor Gevay commented on FLINK-2147: I'm actually now closing the incremental median and similar stuff, as it was based on the assumption of the old (pre-0.10) windowing API that events come in ordered by time, so it doesn't fit into the new API. (See https://github.com/apache/flink/pull/684#issuecomment-195402038) > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/1956#discussion_r63346546 --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/data/HITSData.java --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.graph.examples.data; + +import org.apache.flink.api.java.DataSet; +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.graph.Edge; +import org.apache.flink.graph.Vertex; +import org.apache.flink.types.NullValue; + +import java.util.ArrayList; +import java.util.List; + +/** + * Provides the data set used for the HITS test program. + */ +public class HITSData { + + public static final String VALUE_AFTER_3_ITERATIONS = "1,0.707,0.007\n" + + "2,0.003,0.707\n" + + "3,0.003,0.500\n" + + "4,0.500,0.500\n" + + "5,0.500,0.007\n"; + + + private HITSData() {} + + public static final DataSet> getVertexDataSet(ExecutionEnvironment env) { + + List> vertices = new ArrayList>(); --- End diff -- Can use the diamond operator here, `new ArrayList<>();`. IntelliJ is generally accurate in detecting unnecessary code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284440#comment-15284440 ] Stavros Kontopoulos commented on FLINK-2147: For sure if you apply it per window then you need to avoid keeping any data after you update your algorithm/structure. If you have window results i guess you can update a statistic about the whole stream when this is valid, depending on the statistic. > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient
Github user greghogan commented on the pull request: https://github.com/apache/flink/pull/1896#issuecomment-219416157 If there are no further review comments I'll look at merging this today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/1956#discussion_r63347071 --- Diff: flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/HITSAlgorithm.java --- @@ -0,0 +1,276 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.graph.library; + +import org.apache.flink.api.common.aggregators.DoubleSumAggregator; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.java.DataSet; + +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.api.java.typeutils.TupleTypeInfo; +import org.apache.flink.graph.Edge; +import org.apache.flink.graph.EdgeDirection; +import org.apache.flink.graph.Graph; +import org.apache.flink.graph.GraphAlgorithm; +import org.apache.flink.graph.Vertex; +import org.apache.flink.graph.spargel.MessageIterator; +import org.apache.flink.graph.spargel.MessagingFunction; +import org.apache.flink.graph.spargel.ScatterGatherConfiguration; +import org.apache.flink.graph.spargel.VertexUpdateFunction; +import org.apache.flink.graph.utils.NullValueEdgeMapper; +import org.apache.flink.types.DoubleValue; +import org.apache.flink.types.NullValue; +import org.apache.flink.util.Preconditions; + +/** + * This is an implementation of HITS algorithm, using a scatter-gather iteration. + * The user can define the maximum number of iterations. HITS algorithm is determined by two parameters, + * hubs and authorities. A good hub represents a page that points to many other pages, and a good authority + * represented a page that is linked by many different hubs. + * Each vertex has a value of Tuple2 type, the first field is hub score and the second field is authority score. + * The implementation assumes that the two score are the same in each vertex at the beginning. + * If the number of vertices of the input graph is known, it should be provided as a parameter + * to speed up computation. Otherwise, the algorithm will first execute a job to count the vertices. + * + * + * @see https://en.wikipedia.org/wiki/HITS_algorithm";>HITS Algorithm + */ +public class HITSAlgorithm implements GraphAlgorithm>>> { + + private final static int MAXIMUMITERATION = (Integer.MAX_VALUE - 1) / 2; + private final static double MINIMUMTHRESHOLD = 1e-9; + + private int maxIterations; + private long numberOfVertices; + private double convergeThreshold; + + public HITSAlgorithm(int maxIterations) { + this(maxIterations, MINIMUMTHRESHOLD); + } + + public HITSAlgorithm(double convergeThreshold) { + this(MAXIMUMITERATION, convergeThreshold); + } + + public HITSAlgorithm(int maxIterations, long numberOfVertices) { + this(maxIterations, MINIMUMTHRESHOLD, numberOfVertices); + } + + public HITSAlgorithm(double convergeThreshold, long numberOfVertices) { + this(MAXIMUMITERATION, convergeThreshold, numberOfVertices); + } + + /** +* Creates an instance of HITS algorithm. +* +* @param maxIterations the maximum number of iterations +* @param convergeThreshold convergence threshold for sum of scores +*/ + public HITSAlgorithm(int maxIterations, double convergeThreshold) { + Preconditions.checkArgument(maxIterations > 0, "Number of iterations must be greater than zero."); + Preconditions.checkArgument(convergeThreshold > 0.0, "Convergence threshold must be greater than zero."); + this.maxIterations = maxIterations * 2 + 1; + this.convergeThreshold = convergeThreshold; + } + + /** +* Creates an instance of HITS algorithm. +* +* @param maxIterations the maximum number of iterations +* @param convergeThreshold convergence t
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284446#comment-15284446 ] Stavros Kontopoulos commented on FLINK-2147: Ok so if there multiple windows evaluated at different times in parallel since data comes out of order, what kind of statistic is computable in this model? What are the correct semantics here? Emit a statistic update only when ordering is reconstructed (appropriate windows are calculated) and delay future results? What about count min? > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284467#comment-15284467 ] Gabor Gevay commented on FLINK-2147: In my opinion, the semantics would be to calculate the statistic only about each window separately. When to emit is handled by the triggers (as with other windowing calculations in Flink.) (Note that the windows can be quite large, like weekly or monthly.) I think that having a statistic about the entire stream is rarely what the user actually wants. Flink programs are designed to run indefinitely for a long time, and the starting point of a stream is just when the user happened to start the Flink program, which might have no real semantic meaning if the Flink program is analyzing some external system. > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284468#comment-15284468 ] Gabor Gevay commented on FLINK-2147: (Note that the out-of-order problem that I mentioned for the incremental median stuff was about the situation that I wanted to re-use some part of the calculations across different windows (hence the word "incremental"). This is probably not necessary here, we can just work on separate windows separately.) > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2144) Implement count, average, and variance for windows
[ https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284473#comment-15284473 ] Gabor Gevay commented on FLINK-2144: Originally, I wanted to make this calculation incremental both on new elements and on evictions. With the new windowing API, the eviction part lost its importance. However, making it incremental for new elements is still valid, as this would allow making these calculations without buffering up the window. I think this essentially means that these should be implemented in FoldFunctions, which should be automatically pre-aggregated by Flink. > Implement count, average, and variance for windows > -- > > Key: FLINK-2144 > URL: https://issues.apache.org/jira/browse/FLINK-2144 > Project: Flink > Issue Type: Sub-task > Components: Streaming >Reporter: Gabor Gevay >Assignee: Gabor Gevay >Priority: Minor > Labels: statistics > > By count I mean the number of elements in the window. > These can be implemented very efficiently building on FLINK-2143: > Store: O(1) > Evict: O(1) > emitWindow: O(1) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2144) Implement count, average, and variance for windows
[ https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay updated FLINK-2144: --- Issue Type: New Feature (was: Sub-task) Parent: (was: FLINK-2142) > Implement count, average, and variance for windows > -- > > Key: FLINK-2144 > URL: https://issues.apache.org/jira/browse/FLINK-2144 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Assignee: Gabor Gevay >Priority: Minor > Labels: statistics > > By count I mean the number of elements in the window. > These can be implemented very efficiently building on FLINK-2143: > Store: O(1) > Evict: O(1) > emitWindow: O(1) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2144) Implement incremental count, average, and variance for windows
[ https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay updated FLINK-2144: --- Assignee: (was: Gabor Gevay) > Implement incremental count, average, and variance for windows > -- > > Key: FLINK-2144 > URL: https://issues.apache.org/jira/browse/FLINK-2144 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Priority: Minor > Labels: statistics > > By count I mean the number of elements in the window. > These can be implemented very efficiently building on FLINK-2143: > Store: O(1) > Evict: O(1) > emitWindow: O(1) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2144) Implement incremental count, average, and variance for windows
[ https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay updated FLINK-2144: --- Summary: Implement incremental count, average, and variance for windows (was: Implement count, average, and variance for windows) > Implement incremental count, average, and variance for windows > -- > > Key: FLINK-2144 > URL: https://issues.apache.org/jira/browse/FLINK-2144 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Assignee: Gabor Gevay >Priority: Minor > Labels: statistics > > By count I mean the number of elements in the window. > These can be implemented very efficiently building on FLINK-2143: > Store: O(1) > Evict: O(1) > emitWindow: O(1) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2142) GSoC project: Exact and Approximate Statistics for Data Streams and Windows
[ https://issues.apache.org/jira/browse/FLINK-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284474#comment-15284474 ] Gabor Gevay commented on FLINK-2142: (I've also broken off FLINK-2144.) > GSoC project: Exact and Approximate Statistics for Data Streams and Windows > --- > > Key: FLINK-2142 > URL: https://issues.apache.org/jira/browse/FLINK-2142 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Assignee: Gabor Gevay >Priority: Minor > Labels: gsoc2015, statistics, streaming > > The goal of this project is to implement basic statistics of data streams and > windows (like average, median, variance, correlation, etc.) in a > computationally efficient manner. This involves designing custom PreReducers. > The exact calculation of some statistics (eg. frequencies, or the number of > distinct elements) would require memory proportional to the number of > elements in the input (the window or the entire stream). However, there are > efficient algorithms and data structures using less memory for calculating > the same statistics only approximately, with user-specified error bounds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2144) Incremental count, average, and variance for windows
[ https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Gevay updated FLINK-2144: --- Summary: Incremental count, average, and variance for windows (was: Implement incremental count, average, and variance for windows) > Incremental count, average, and variance for windows > > > Key: FLINK-2144 > URL: https://issues.apache.org/jira/browse/FLINK-2144 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Priority: Minor > Labels: statistics > > By count I mean the number of elements in the window. > These can be implemented very efficiently building on FLINK-2143: > Store: O(1) > Evict: O(1) > emitWindow: O(1) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284476#comment-15284476 ] Stavros Kontopoulos commented on FLINK-2147: Ok i agree then we calculate statistics per window in isolated manner like sum, mean etc without the aggregation in buffer. Ok so lets see how we avoid that correct? > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284479#comment-15284479 ] Gabor Gevay commented on FLINK-2147: Yes. We should probably look in the direction of FoldFunction, since Flink does preaggregation for these, if I'm not mistaken. > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams
[ https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284484#comment-15284484 ] Stavros Kontopoulos commented on FLINK-2147: ok i will have a look as well to get familiar with it. > Approximate calculation of frequencies in data streams > -- > > Key: FLINK-2147 > URL: https://issues.apache.org/jira/browse/FLINK-2147 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay > Labels: approximate, statistics > > Count-Min sketch is a hashing-based algorithm for approximately keeping track > of the frequencies of elements in a data stream. It is described by Cormode > et al. in the following paper: > http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf > Note that this algorithm can be conveniently implemented in a distributed > way, as described in section 3.2 of the paper. > The paper > http://www.vldb.org/conf/2002/S10P03.pdf > also describes algorithms for approximately keeping track of frequencies, but > here the user can specify a threshold below which she is not interested in > the frequency of an element. The error-bounds are also different than the > Count-min sketch algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...
Github user gallenvara commented on a diff in the pull request: https://github.com/apache/flink/pull/1956#discussion_r63353881 --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/data/HITSData.java --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.graph.examples.data; + +import org.apache.flink.api.java.DataSet; +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.graph.Edge; +import org.apache.flink.graph.Vertex; +import org.apache.flink.types.NullValue; + +import java.util.ArrayList; +import java.util.List; + +/** + * Provides the data set used for the HITS test program. + */ +public class HITSData { + + public static final String VALUE_AFTER_3_ITERATIONS = "1,0.707,0.007\n" + --- End diff -- Yes, i used approximate number of the result. I would replace the number of iterations with a larger one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...
Github user gallenvara commented on a diff in the pull request: https://github.com/apache/flink/pull/1956#discussion_r63354317 --- Diff: flink-libraries/flink-gelly-examples/src/test/java/org/apache/flink/graph/library/HITSAlgorithmITCase.java --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.graph.library; + +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.graph.Graph; +import org.apache.flink.graph.Vertex; +import org.apache.flink.graph.examples.data.HITSData; +import org.apache.flink.test.util.MultipleProgramsTestBase; +import org.apache.flink.types.DoubleValue; +import org.apache.flink.types.NullValue; +import org.junit.Assert; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.Parameterized; + +import java.util.Arrays; +import java.util.List; + +@RunWith(Parameterized.class) +public class HITSAlgorithmITCase extends MultipleProgramsTestBase{ --- End diff -- In the `MultipleProgramsTestBase`, the default test mode is `COLLECTION`. Should we specify the mode manually? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...
Github user gallenvara commented on a diff in the pull request: https://github.com/apache/flink/pull/1956#discussion_r63354507 --- Diff: flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/examples/data/HITSData.java --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.graph.examples.data; + +import org.apache.flink.api.java.DataSet; +import org.apache.flink.api.java.ExecutionEnvironment; +import org.apache.flink.graph.Edge; +import org.apache.flink.graph.Vertex; +import org.apache.flink.types.NullValue; + +import java.util.ArrayList; +import java.util.List; + +/** + * Provides the data set used for the HITS test program. + */ +public class HITSData { + + public static final String VALUE_AFTER_3_ITERATIONS = "1,0.707,0.007\n" + + "2,0.003,0.707\n" + + "3,0.003,0.500\n" + + "4,0.500,0.500\n" + + "5,0.500,0.007\n"; + + + private HITSData() {} + + public static final DataSet> getVertexDataSet(ExecutionEnvironment env) { + + List> vertices = new ArrayList>(); --- End diff -- OK, i will modify relevant codes in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient
Github user vasia commented on the pull request: https://github.com/apache/flink/pull/1896#issuecomment-219454345 Hi @greghogan, I haven't had time to review this PR. If this is blocking you or is urgent for some reason, please go ahead. Otherwise, I could take a look later this week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3768) Clustering Coefficient
[ https://issues.apache.org/jira/browse/FLINK-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285067#comment-15285067 ] ASF GitHub Bot commented on FLINK-3768: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1896 > Clustering Coefficient > -- > > Key: FLINK-3768 > URL: https://issues.apache.org/jira/browse/FLINK-3768 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > The local clustering coefficient measures the connectedness of each vertex's > neighborhood. Values range from 0.0 (no edges between neighbors) to 1.0 > (neighborhood is a clique). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3768] [gelly] Clustering Coefficient
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1896 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Closed] (FLINK-3768) Clustering Coefficient
[ https://issues.apache.org/jira/browse/FLINK-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Hogan closed FLINK-3768. - Resolution: Implemented Implemented in c71675f7cbe7e538d62bf1491aff69b369eda9eb > Clustering Coefficient > -- > > Key: FLINK-3768 > URL: https://issues.apache.org/jira/browse/FLINK-3768 > Project: Flink > Issue Type: New Feature > Components: Gelly >Affects Versions: 1.1.0 >Reporter: Greg Hogan >Assignee: Greg Hogan > Fix For: 1.1.0 > > > The local clustering coefficient measures the connectedness of each vertex's > neighborhood. Values range from 0.0 (no edges between neighbors) to 1.0 > (neighborhood is a clique). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2829] Confusing error message when Flin...
GitHub user rekhajoshm opened a pull request: https://github.com/apache/flink/pull/1993 [FLINK-2829] Confusing error message when Flink cannot create enough task threads [FLINK-2829] Confusing error message when Flink cannot create enough task threads Clarifying the flink runtime error message on slot details on make sense for user facing them when Flink cannot create enough task threads. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/flink FLINK-2829 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1993.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1993 commit b1f4c11f1878d53656c9ca49c6912a95a449f18e Author: Joshi Date: 2016-05-16T21:14:23Z [FLINK-2829] Clarifying error message when Flink cannot create enough task threads --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2829) Confusing error message when Flink cannot create enough task threads
[ https://issues.apache.org/jira/browse/FLINK-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285315#comment-15285315 ] ASF GitHub Bot commented on FLINK-2829: --- GitHub user rekhajoshm opened a pull request: https://github.com/apache/flink/pull/1993 [FLINK-2829] Confusing error message when Flink cannot create enough task threads [FLINK-2829] Confusing error message when Flink cannot create enough task threads Clarifying the flink runtime error message on slot details on make sense for user facing them when Flink cannot create enough task threads. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/flink FLINK-2829 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1993.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1993 commit b1f4c11f1878d53656c9ca49c6912a95a449f18e Author: Joshi Date: 2016-05-16T21:14:23Z [FLINK-2829] Clarifying error message when Flink cannot create enough task threads > Confusing error message when Flink cannot create enough task threads > > > Key: FLINK-2829 > URL: https://issues.apache.org/jira/browse/FLINK-2829 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager >Reporter: Gyula Fora >Priority: Trivial > > When Flink runs out of memory while creating too many task threads, the error > message received from the job manager is slightly confusing: > java.lang.Exception: Failed to deploy the task to slot SimpleSlot (1)(63) - > eea7250ab5b368693e3c4f14fb94f86d @ localhost - 8 slots - URL: > akka://flink/user/taskmanager_1 - ALLOCATED/ALIVE: Response was not of type > Acknowledge > at > org.apache.flink.runtime.executiongraph.Execution$2.onComplete(Execution.java:392) > > at akka.dispatch.OnComplete.internal(Future.scala:247) > at akka.dispatch.OnComplete.internal(Future.scala:244) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > scala.concurrent.impl.ExecutionContextImpl$anon$3.exec(ExecutionContextImpl.scala:107) > > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Although the error comes from the Taskmanager: > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:714) > at > org.apache.flink.runtime.taskmanager.Task.startTaskThread(Task.java:415) > at > org.apache.flink.runtime.taskmanager.TaskManager.submitTask(TaskManager.scala:904) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3900] Set nullCheck=true as default in ...
GitHub user rekhajoshm opened a pull request: https://github.com/apache/flink/pull/1994 [FLINK-3900] Set nullCheck=true as default in TableConfig [FLINK-3900] Settig nullCheck as true as default in TableConfig You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/flink FLINK-3900 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1994.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1994 commit 3a4f10a383a589f3880d06a6ad140498d8a521f6 Author: Joshi Date: 2016-05-16T21:25:12Z [FLINK-3900] Set nullCheck=true as default in TableConfig --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3900) Set nullCheck=true as default in TableConfig
[ https://issues.apache.org/jira/browse/FLINK-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285339#comment-15285339 ] ASF GitHub Bot commented on FLINK-3900: --- GitHub user rekhajoshm opened a pull request: https://github.com/apache/flink/pull/1994 [FLINK-3900] Set nullCheck=true as default in TableConfig [FLINK-3900] Settig nullCheck as true as default in TableConfig You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/flink FLINK-3900 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1994.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1994 commit 3a4f10a383a589f3880d06a6ad140498d8a521f6 Author: Joshi Date: 2016-05-16T21:25:12Z [FLINK-3900] Set nullCheck=true as default in TableConfig > Set nullCheck=true as default in TableConfig > > > Key: FLINK-3900 > URL: https://issues.apache.org/jira/browse/FLINK-3900 > Project: Flink > Issue Type: Improvement > Components: Table API >Affects Versions: 1.1.0 >Reporter: Flavio Pompermaier >Priority: Minor > > As discussed with Fabian, TableConfig should use nullCheck=true as default to > allow for null values in the data -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-3782] ByteArrayOutputStream and ObjectO...
GitHub user rekhajoshm opened a pull request: https://github.com/apache/flink/pull/1995 [FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close [FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close The ByteArrayOutputStream close method is useless and has no impact, so is usually never called.However I am using try with resources for both to take care of closing closeable resources automatically. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/flink FLINK-3782 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1995.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1995 commit 2fec41be0f1ef8f1b9f707d0085d6f9fca8101bb Author: Joshi Date: 2016-05-16T22:18:42Z [FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-3782) ByteArrayOutputStream and ObjectOutputStream should close
[ https://issues.apache.org/jira/browse/FLINK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285538#comment-15285538 ] ASF GitHub Bot commented on FLINK-3782: --- GitHub user rekhajoshm opened a pull request: https://github.com/apache/flink/pull/1995 [FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close [FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close The ByteArrayOutputStream close method is useless and has no impact, so is usually never called.However I am using try with resources for both to take care of closing closeable resources automatically. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/flink FLINK-3782 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1995.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1995 commit 2fec41be0f1ef8f1b9f707d0085d6f9fca8101bb Author: Joshi Date: 2016-05-16T22:18:42Z [FLINK-3782] ByteArrayOutputStream and ObjectOutputStream should close > ByteArrayOutputStream and ObjectOutputStream should close > - > > Key: FLINK-3782 > URL: https://issues.apache.org/jira/browse/FLINK-3782 > Project: Flink > Issue Type: Test > Components: Java API >Affects Versions: 1.0.1 >Reporter: Chenguang He >Priority: Minor > Labels: test > > ByteArrayOutputStream and ObjectOutputStream should close > @Test > public void testSerializability() { > try { > Collection inputCollection = new > ArrayList(); > ElementType element1 = new ElementType(1); > ElementType element2 = new ElementType(2); > ElementType element3 = new ElementType(3); > inputCollection.add(element1); > inputCollection.add(element2); > inputCollection.add(element3); > > @SuppressWarnings("unchecked") > TypeInformation info = > (TypeInformation) > TypeExtractor.createTypeInfo(ElementType.class); > > CollectionInputFormat inputFormat = new > CollectionInputFormat(inputCollection, > info.createSerializer(new > ExecutionConfig())); > ByteArrayOutputStream buffer = new > ByteArrayOutputStream();//ObjectOutputStream out = new > ObjectOutputStream(buffer);// out.writeObject(inputFormat); > ObjectInputStream in = new ObjectInputStream(new > ByteArrayInputStream(buffer.toByteArray())); // Object serializationResult = in.readObject(); > assertNotNull(serializationResult); > assertTrue(serializationResult instanceof > CollectionInputFormat); > @SuppressWarnings("unchecked") > CollectionInputFormat result = > (CollectionInputFormat) serializationResult; > GenericInputSplit inputSplit = new GenericInputSplit(0, > 1); > inputFormat.open(inputSplit); > result.open(inputSplit); > while(!inputFormat.reachedEnd() && > !result.reachedEnd()){ > ElementType expectedElement = > inputFormat.nextRecord(null); > ElementType actualElement = > result.nextRecord(null); > assertEquals(expectedElement, actualElement); > } > } > catch(Exception e) { > e.printStackTrace(); > fail(e.toString()); > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2044] [gelly] Implementation of Gelly H...
Github user gallenvara commented on the pull request: https://github.com/apache/flink/pull/1956#issuecomment-219605318 @greghogan thanks for your advice and relevant codes have been modified. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2044) Implementation of Gelly HITS Algorithm
[ https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285904#comment-15285904 ] ASF GitHub Bot commented on FLINK-2044: --- Github user gallenvara commented on the pull request: https://github.com/apache/flink/pull/1956#issuecomment-219605318 @greghogan thanks for your advice and relevant codes have been modified. :) > Implementation of Gelly HITS Algorithm > -- > > Key: FLINK-2044 > URL: https://issues.apache.org/jira/browse/FLINK-2044 > Project: Flink > Issue Type: New Feature > Components: Gelly >Reporter: Ahamd Javid >Assignee: GaoLun >Priority: Minor > > Implementation of Hits Algorithm in Gelly API using Java. the feature branch > can be found here: (https://github.com/JavidMayar/flink/commits/HITS) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-3915) Updating a Dataset in Flink using Flink's Table API
Akshay Shingote created FLINK-3915: -- Summary: Updating a Dataset in Flink using Flink's Table API Key: FLINK-3915 URL: https://issues.apache.org/jira/browse/FLINK-3915 Project: Flink Issue Type: Wish Components: Table API Affects Versions: 1.0.1 Reporter: Akshay Shingote I am using Flink Table API. I want to update the table in Flink through Pattern Detection..I am using 3 fields : routeno,source,distance,category . Now I want to update the category based on the value of distance for every routeno...For ex : if routeno=1 and distance<=200 then category='daily' ..How can I update this using Flink's Table API?? Can we update dataset or Table using Flink's Table API ?? -- This message was sent by Atlassian JIRA (v6.3.4#6332)