Re: Could not build up connection to JobManager
Here is the taskmanager log when I tried taskmanager.sh start flink-Vidura-taskmanager-localhost.log https://gist.github.com/anonymous/aef5a0bf8722feee9b97#file-flink-vidura-taskmanager-localhost-log On Feb 27, 2015, at 4:12 PM, Till Rohrmann trohrm...@apache.org wrote: It depends on how you started Flink. If you started a local cluster, then the TaskManager log is contained in the JobManager log we just don't see the respective log output in the snippet you posted. If you started a TaskManager independently, either by taskmanager.sh or by start-cluster.sh, then a file with the name format flink-user-taskmanager-hostname.log should be created in flink/log/. If the Flink directory is not shared by your cluster nodes, then you have to look on the machine on which you started the TaskManager. But since the JobManager binds to 127.0.0.1 I guess that you started a local cluster. Try whether you find some logging statements from the logger org.apache.flink.runtime.taskmanager.TaskManager in your log. Maybe you can upload the corresponding log file to [1] and post a link here. Greets, Till [1] https://gist.github.com/ On Thu, Feb 26, 2015 at 6:45 PM, Dulaj Viduranga vidura...@icloud.com wrote: Hi, Can you tell me where I can find TaskManager logs. I can’t find them in logs folder? I don’t suppose I should run taskmanager.sh as well. Right? I’m using a OS X Yosemite. I’ll send you my ifconfig. lo0: flags=8049UP,LOOPBACK,RUNNING,MULTICAST mtu 16384 options=3RXCSUM,TXCSUM inet6 ::1 prefixlen 128 inet 127.0.0.1 netmask 0xff00 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 nd6 options=1PERFORMNUD gif0: flags=8010POINTOPOINT,MULTICAST mtu 1280 stf0: flags=0 mtu 1280 en0: flags=8823UP,BROADCAST,SMART,SIMPLEX,MULTICAST mtu 1500 ether 60:03:08:a1:e0:f4 nd6 options=1PERFORMNUD media: autoselect (unknown type) status: inactive en1: flags=8963UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST mtu 1500 options=60TSO4,TSO6 ether 72:00:02:32:14:d0 media: autoselect full-duplex status: inactive en2: flags=8963UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST mtu 1500 options=60TSO4,TSO6 ether 72:00:02:32:14:d1 media: autoselect full-duplex status: inactive bridge0: flags=8822BROADCAST,SMART,SIMPLEX,MULTICAST mtu 1500 options=63RXCSUM,TXCSUM,TSO4,TSO6 ether 62:03:08:1a:fa:00 Configuration: id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0 maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200 root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0 ipfilter disabled flags 0x2 member: en1 flags=3LEARNING,DISCOVER ifmaxaddr 0 port 5 priority 0 path cost 0 member: en2 flags=3LEARNING,DISCOVER ifmaxaddr 0 port 6 priority 0 path cost 0 media: unknown type status: inactive p2p0: flags=8802BROADCAST,SIMPLEX,MULTICAST mtu 2304 ether 02:03:08:a1:e0:f4 media: autoselect status: inactive awdl0: flags=8802BROADCAST,SIMPLEX,MULTICAST mtu 1452 ether 06:56:3d:f6:60:08 nd6 options=1PERFORMNUD media: autoselect status: inactive ppp0: flags=8051UP,POINTOPOINT,RUNNING,MULTICAST mtu 1500 inet 10.218.98.228 -- 10.64.64.64 netmask 0xff00 utun0: flags=8051UP,POINTOPOINT,RUNNING,MULTICAST mtu 1380 inet6 fe80::b0d4:d4be:7e62:e730%utun0 prefixlen 64 scopeid 0xb inet6 fdd0:b291:7da7:9153:b0d4:d4be:7e62:e730 prefixlen 64 nd6 options=1PERFORMNUD On Feb 26, 2015, at 10:48 PM, Stephan Ewen se...@apache.org wrote: Hi Dulaj! Thanks for helping to debug. My guess is that you are seeing now the same thing between JobManager and TaskManager as you saw before between JobManager and JobClient. I have a patch pending that should help the issue (see https://issues.apache.org/jira/browse/FLINK-1608), let's see if that solves it. What seems not right is that the JobManager initially accepted the TaskManager and later the communication. Can you paste the TaskManager log as well? Also: There must be something fairly unique about your network configuration, as it works on all other setups that we use (locally, cloud, test servers, YARN, ...). Can you paste your ipconfig / ifconfig by any chance? Greetings, Stephan On Thu, Feb 26, 2015 at 4:33 PM, Dulaj Viduranga vidura...@icloud.com wrote: Hi, It’s great to help out. :) Setting 127.0.0.1 instead of “localhost” in jobmanager.rpc.address, helped to build the connection to the jobmanager. Apparently localhost resolving is different in webclient and the jobmanager. I think it’s good to set jobmanager.rpc.address: 127.0.0.1 in future builds. But then I get this error when I tried to run examples. I don’t know if I
Thoughts About Object Reuse and Collection Execution
Hello Nation of Flink, while figuring out this bug: https://issues.apache.org/jira/browse/FLINK-1569 I came upon some difficulties. The problem is that the KeyExtractorMappers always return the same tuple. This is problematic, since Collection Execution does simply store the returned values in a list. These elements are not copied before they are stored when object reuse is enabled. Therefore, the whole list will contain only that one reused element. I see two options for solving this: 1. Change KeyExtractorMappers to always return a new tuple, thereby making object-reuse mode in cluster execution useless for key extractors. 2. Change collection execution mapper to always make copies of the returned elements. This would make object-reuse in collection execution pretty much obsolete, IMHO. How should we proceed with this? Cheers, Aljoscha
Re: [DISCUSS] Dedicated streaming mode and start scripts
Today we had a discussion with Robert on this issue. I would like to eventually have the streaming grouped and the windowing buffers/state maybe along with the crucial state of the user in the managed memory. If we had this separating the two modes could became less important as streaming would also use this space. I do not propose to change the above decision for the current needs, this is just a heads up. On Wed, Feb 18, 2015 at 1:11 PM, Paris Carbone par...@kth.se wrote: +1 I agree it’s a proper way to go. On 18 Feb 2015, at 10:41, Max Michels m...@apache.orgmailto: m...@apache.org wrote: +1 On Tue, Feb 17, 2015 at 2:40 PM, Aljoscha Krettek aljos...@apache.org mailto:aljos...@apache.org wrote: +1 On Tue, Feb 17, 2015 at 1:34 PM, Till Rohrmann trohrm...@apache.org mailto:trohrm...@apache.org wrote: +1 On Tue, Feb 17, 2015 at 1:34 PM, Kostas Tzoumas ktzou...@apache.org mailto:ktzou...@apache.org wrote: +1 On Tue, Feb 17, 2015 at 12:14 PM, Márton Balassi mbala...@apache.org mailto:mbala...@apache.org wrote: When it comes to the current use cases I'm for this separation. @Ufuk: As Gyula has already pointed out with the current design of integration it should not be a problem. Even if we submitted programs to the wrong clusters it would only cause performance issues. Eventually it would be nice to have an integrated cluster. On Tue, Feb 17, 2015 at 11:55 AM, Ufuk Celebi u...@apache.orgmailto: u...@apache.org wrote: I think this separation reflects the way that Flink is used currently anyways. I would be in favor of it as well. - What about the ongoing efforts (I think by Gyula) to combine both the batch and stream processing APIs? I assume that this would only effect the performance and wouldn't pose a fundamental problem there, would it?
Re: Tweets Custom Input Format
@robert, I have created the PR https://github.com/apache/flink/pull/442, On Fri, Feb 27, 2015 at 11:58 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: @Robert, Thanks I was asking about the procedure. I have opened a Jira ticket for Flink-Contrib and I will create a PR with the naming convention on Wiki, https://issues.apache.org/jira/browse/FLINK-1615, On Fri, Feb 27, 2015 at 11:55 AM, Robert Metzger rmetz...@apache.org wrote: I'm glad you've found the how to contribute guide. I can not describe the process to open a pull request better than already written in the guide. Maybe this link is also helpful for you: https://help.github.com/articles/creating-a-pull-request/ Are you facing a particular error message? Maybe that helps me to help you better. On Fri, Feb 27, 2015 at 10:46 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Actually I am reading How to contribute now to push the code. Its working and tested locally and on the cluster, and i have used it for an ETL. The structure as follow :- Java Pojos for the tweet object, and the nested objects. Parser class using event-driven approach, and the SimpleTweetInputFormat itself. Would you guide me how to push the code, just to save sometime :) On Fri, Feb 27, 2015 at 10:42 AM, Robert Metzger rmetz...@apache.org wrote: Hi, cool! Can you generalize the input format to read JSON into an arbitrary POJO? It would be great if you could contribute the InputFormat into the flink-contrib module. I've seen many users reading JSON data with Flink, so its good to have a standard solution for that. If you want you can add the Tweet into POJO as an example into flink-contrib. On Fri, Feb 27, 2015 at 10:37 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Hi, I am really sorry for being so late, it was a whole month of projects and examination, I was really busy. @Robert, it is IF for reading tweet into Pojo. I use an event-driven parser, I retrieve most of the tweet into Java Pojos, it was tested on 1TB dataset, for a Flink ETL job, and the performance was pretty good. On Sun, Jan 25, 2015 at 7:38 PM, Robert Metzger rmetz...@apache.org wrote: Hey, is it a input format for reading JSON data or an IF for reading tweets in some format into a pojo? I think a JSON Input Format would be something very useful for our users. Maybe you can add that and use the Tweet IF as a concrete example for that? Do you have a preview of the code somewhere? Best, Robert On Sat, Jan 24, 2015 at 11:06 AM, Fabian Hueske fhue...@gmail.com wrote: Hi Mustafa, that would be a nice contribution! We are currently discussing how to add non-core API features into Flink [1]. I will move this discussion onto the mailing list to decide where to add cool add-ons like yours. Cheers, Fabian [1] https://issues.apache.org/jira/browse/FLINK-1398 2015-01-23 20:42 GMT+01:00 Henry Saputra henry.sapu...@gmail.com : Contributions are welcomed! Here is the link on how to contribute to Apache Flink: http://flink.apache.org/how-to-contribute.html You can start by creating JIRA ticket [1] to help describe what you wanted to do and to get feedback from community. - Henry [1] https://issues.apache.org/jira/secure/Dashboard.jspa On Fri, Jan 23, 2015 at 10:54 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Hi, I have created a custom InputFormat for tweets on Flink, based on JSON-Simple event driven parser. I would like to contribute my work into Flink, Regards. -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87
Re: Thoughts About Object Reuse and Collection Execution
I vote to have the key extractor return a new value each time. That means that objects are not reused everywhere where it is possible, but still in most places, which still helps. What still puzzles me: I thought that the collection execution stores copies of the returned records by default (reuse safe mode). Am 27.02.2015 15:36 schrieb Aljoscha Krettek aljos...@apache.org: Hello Nation of Flink, while figuring out this bug: https://issues.apache.org/jira/browse/FLINK-1569 I came upon some difficulties. The problem is that the KeyExtractorMappers always return the same tuple. This is problematic, since Collection Execution does simply store the returned values in a list. These elements are not copied before they are stored when object reuse is enabled. Therefore, the whole list will contain only that one reused element. I see two options for solving this: 1. Change KeyExtractorMappers to always return a new tuple, thereby making object-reuse mode in cluster execution useless for key extractors. 2. Change collection execution mapper to always make copies of the returned elements. This would make object-reuse in collection execution pretty much obsolete, IMHO. How should we proceed with this? Cheers, Aljoscha
Flink Streaming parallelism bug report
As I know, the time of creation of the execution environment has been slightly modified in the streaming API, which caused that dataStream.getParallelism() and dataStream.env.getDegreeOfParallelism() may return different values. Usage of the former is recommended. In theory, the latter is eliminated from the code, but there might be some more left, hiding. I've recently fixed one in WindowedDataStream. If you encounter problems with the parallelism, it may be the cause. Peter
Re: Flink Streaming parallelism bug report
They should actually return different values in many cases. Datastream.env.getDegreeOfParallelism returns the environment parallelism (default) Datastream.getparallelism() returns the parallelism of the operator. There is a reason when one or the other is used. Please watch out when you try to modify that because you might actually break functionality there :p On Feb 27, 2015 8:55 AM, Szabó Péter nemderogator...@gmail.com wrote: As I know, the time of creation of the execution environment has been slightly modified in the streaming API, which caused that dataStream.getParallelism() and dataStream.env.getDegreeOfParallelism() may return different values. Usage of the former is recommended. In theory, the latter is eliminated from the code, but there might be some more left, hiding. I've recently fixed one in WindowedDataStream. If you encounter problems with the parallelism, it may be the cause. Peter
Re: Contributing to Flink
Hi Niraj, Pleased to here you want to start contributing to Flink :) In terms of security, there are some open issues. Like Robert metioned, it would be great if you could implement proper HDFS Kerberos authentication. Basically, the HDFS Delegation Token needs to be transferred to the workers so that they don't have to be authenticated themselves. Also, proper renewal of the Token needs to be taken care of. Another security task could be to implement authentication support in Flink. This could be done by asking the user for some kind of shared secret. Alternatively, authentication could also be performed by Kerberos' authentication method. Let me know what you find interesting. Best regards, Max On Fri, Feb 27, 2015 at 2:31 AM, Rai, Niraj niraj@intel.com wrote: Hi Robert, Thanks for the detailed response. I worked on the encryption of HDFS as well as the crypto file system in HDFS, so, I am aware of how it is done in Hadoop. Let me sync up with Max to get started on it. I will also start looking into the current implementations. Niraj From: Robert Metzger [mailto:rmetz...@apache.org] Sent: Thursday, February 26, 2015 3:11 PM To: dev@flink.apache.org Cc: Rai, Niraj Subject: Re: Contributing to Flink Hi Niraj, Welcome to the Flink community ;) I'm really excited that you want to contribute to our project, and since you've asked for something in the security area, I actually have something very concrete in mind. We recently added some support for accessing (Kerberos) secured HDFS clusters in Flink: https://issues.apache.org/jira/browse/FLINK-1504. However, the implementation is very simple because it assumes that every worker of Flink (TaskManager) is authenticated with Kerberos (kinit). Its not very practical for large setups because you have to ssh to all machines to log into Kerberos. What I would really like to have in Flink would be an way to transfer the authentication tokens form the JobManager (master) to the TaskManagers. This way, users only have to be authenticated with Kerberos at the JobManager, and Flink is taking care of the rest. As far as I understood it, Hadoop has already all the utilities in place for getting and transferring the delegation tokens. Max Michels, another committer in our project has quite a good understanding of the details there. It would be great if you (Max) could chime in if I forgot something. If you are interested in working on this, you can file a JIRA (https://issues.apache.org/jira/browse/FLINK) for tracking the progress and discussing the details. If not I'm sure we'll come up with more interesting ideas. Robert On Thu, Feb 26, 2015 at 11:07 PM, Henry Saputra henry.sapu...@gmail.commailto:henry.sapu...@gmail.com wrote: Hi Niraj, Thanks for your interest at Apache Flink. The quickest is to just give Flink a spin and figure out how it works. This would get you start on how it works before actually doing work on Flink =) Please do visit Flink how to contribute page [1] and subscribe to dev mailing list [2] to start following up. Welcome =) [1] http://flink.apache.org/how-to-contribute.html [2] http://flink.apache.org/community.html#mailing-lists On Thu, Feb 26, 2015 at 1:45 PM, Rai, Niraj niraj@intel.commailto:niraj@intel.com wrote: Hi Flink Dev, I am looking to contribute to Flink, especially in the area of security. In the past, I have contributed to Pig, Hive and HDFS. I would really appreciate, if I can get some work assigned to me. Looking forward to hear back from the development community of Flink. Thanks Niraj
[DISCUSS] URI NullPointerException in TestBaseUtils
The following code snippet in from TestBaseUtils: protected static File asFile(String path) { try { URI uri = new URI(path); if (uri.getScheme().equals(file)) { return new File(uri.getPath()); } else { throw new IllegalArgumentException(This path does not denote a local file.); } } catch (URISyntaxException e) { throw new IllegalArgumentException(This path does not describe a valid local file URI.); } } If uri does not have a scheme (e.g. /home/something.txt), uri.getScheme().equals(file) throws a NullPointerException instead of an IllegalArgumentException is thrown. I feel it would make more sense to catch the NullPointerException at the end. What do you guys think? Peter
Re: Flink Streaming parallelism bug report
Okay, thanks! In my case, I tried to run an ITCase test and the environment parallelism is happened to be -1, and an exception was thrown. The other ITCases ran properly, so I figured, the problem is with the windowing. Can you check it out for me? (WindowedDataStream, line 348) Peter 2015-02-27 10:06 GMT+01:00 Gyula Fóra gyf...@apache.org: They should actually return different values in many cases. Datastream.env.getDegreeOfParallelism returns the environment parallelism (default) Datastream.getparallelism() returns the parallelism of the operator. There is a reason when one or the other is used. Please watch out when you try to modify that because you might actually break functionality there :p On Feb 27, 2015 8:55 AM, Szabó Péter nemderogator...@gmail.com wrote: As I know, the time of creation of the execution environment has been slightly modified in the streaming API, which caused that dataStream.getParallelism() and dataStream.env.getDegreeOfParallelism() may return different values. Usage of the former is recommended. In theory, the latter is eliminated from the code, but there might be some more left, hiding. I've recently fixed one in WindowedDataStream. If you encounter problems with the parallelism, it may be the cause. Peter
Re: Drop support for CDH4 / Hadoop 2.0.0-alpha
@Henry: We would still shade Hadoop because of its Guava / ASM dependencies which interfere with our dependencies. The nice thing of my change is that all the other flink modules don't have to care about the details of our Hadoop dependencie. Its basically an abstract hadoop dependency, without guava and asm ;) @Stephan: The actual issue is not CDH4 itself but shading protobuf into hadoop. If you think we should shade protobuf out of Hadoop, then we can keep the CDH4 support and I have to figure out a way to get the flink-yarn-tests running. I'm not aware of a user who complained about our protobuf dependency. On Fri, Feb 27, 2015 at 1:23 AM, Henry Saputra henry.sapu...@gmail.com wrote: If we were to drop CDH4 / Hadoop 2.0.0-alpha, would this mean we do not even to shade the hadoop fat jars, or we do still needed to support 1.x ? - Henry On Thu, Feb 26, 2015 at 8:57 AM, Robert Metzger rmetz...@apache.org wrote: Hi, I'm currently working on https://issues.apache.org/jira/browse/FLINK-1605 and its a hell of a mess. I got almost everything working, except for the hadoop 2.0.0-alpha profile. The profile exists because google protobuf has a different version in that Hadoop release. Since maven is setting the version of protobuf for the entire project to the older version, we have to use an older akka version which is causing issues. The logical conclusion from that would be shading Hadoop's protobuf version into the Hadoop jars. That by itself is working, however its not working for the flink-yarn-tests. I think I can also solve the issue with the flink-yarn-tests, but it would be a very dirty hack (either injecting shaded code into the failsafe tests-classpath or putting test code into src/main). But the general question remains: Are we willing to continue spending a lot of time on maintaining the profile? Till has spend a lot of time recently to fix failing testcases for that old akka version, I spend almost two days now on getting the shading/dependencies right, and I'm sure we'll keep having troubles with the profile. Therefore, I was wondering if this is the right time to drop support for CDH4 / Hadoop 2.0.0-alpha. Best, Robert
Re: Tweets Custom Input Format
Hi, cool! Can you generalize the input format to read JSON into an arbitrary POJO? It would be great if you could contribute the InputFormat into the flink-contrib module. I've seen many users reading JSON data with Flink, so its good to have a standard solution for that. If you want you can add the Tweet into POJO as an example into flink-contrib. On Fri, Feb 27, 2015 at 10:37 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Hi, I am really sorry for being so late, it was a whole month of projects and examination, I was really busy. @Robert, it is IF for reading tweet into Pojo. I use an event-driven parser, I retrieve most of the tweet into Java Pojos, it was tested on 1TB dataset, for a Flink ETL job, and the performance was pretty good. On Sun, Jan 25, 2015 at 7:38 PM, Robert Metzger rmetz...@apache.org wrote: Hey, is it a input format for reading JSON data or an IF for reading tweets in some format into a pojo? I think a JSON Input Format would be something very useful for our users. Maybe you can add that and use the Tweet IF as a concrete example for that? Do you have a preview of the code somewhere? Best, Robert On Sat, Jan 24, 2015 at 11:06 AM, Fabian Hueske fhue...@gmail.com wrote: Hi Mustafa, that would be a nice contribution! We are currently discussing how to add non-core API features into Flink [1]. I will move this discussion onto the mailing list to decide where to add cool add-ons like yours. Cheers, Fabian [1] https://issues.apache.org/jira/browse/FLINK-1398 2015-01-23 20:42 GMT+01:00 Henry Saputra henry.sapu...@gmail.com: Contributions are welcomed! Here is the link on how to contribute to Apache Flink: http://flink.apache.org/how-to-contribute.html You can start by creating JIRA ticket [1] to help describe what you wanted to do and to get feedback from community. - Henry [1] https://issues.apache.org/jira/secure/Dashboard.jspa On Fri, Jan 23, 2015 at 10:54 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Hi, I have created a custom InputFormat for tweets on Flink, based on JSON-Simple event driven parser. I would like to contribute my work into Flink, Regards. -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87
Re: [DISCUSS] Iterative streaming example
Cool! At the moment I don't have any good use cases, but I will read some literature about it in the near future. The first priority for me is to make a good streaming iteration example, and Márton liked the machine-learning idea. That, and there is a group in SZTAKI that develops recommendation systems and we'd like to cooperate in order to implement some of their algorithms in Flink Streaming. Peter 2015-02-26 23:30 GMT+01:00 Paris Carbone par...@kth.se: We haven’t yet implemented any of these machine learning models directly on the Flink api but we have run them through the existing Samoa tasks, using Flink Streaming as a backend. Apart from it we have a student looking into machine learning pipelines on Flink Streaming with a focus on iterative jobs so we will have many more use cases coming soon. Are you also considering looking into something similar? Perhaps I can help more if you have some specific use case in mind. Paris On 23 Feb 2015, at 14:29, Szabó Péter nemderogator...@gmail.commailto: nemderogator...@gmail.com wrote: Nice. Thank you guys! @Paris Are there any Flink implementations of this model? The GitHub doc is quite general. Peter 2015-02-23 14:05 GMT+01:00 Paris Carbone par...@kth.semailto: par...@kth.se: Hello Peter, Streaming machine learning algorithms make use of iterations quite widely. One simple example is implementing distributed stream learners. There, in many cases you need some central model aggregator, distributed estimators to offload the central node and of course feedback loops to merge everything back to the main aggregator periodically. One such example in the Vertical Hoeffding Tree Classifier (VFDT) [1] that is implemented in Samoa. Iterative streams are also useful for optimisation techniques as in batch processing (eg. trying different parameters to estimate a variable, getting back the accuracy from an evaluator and repeating until a condition is achieved). I hope this helps to get a general idea of where iterations can be used. [1] https://github.com/yahoo/samoa/wiki/Vertical-Hoeffding-Tree-Classifier On 23 Feb 2015, at 12:13, Stephan Ewen se...@apache.orgmailto: se...@apache.orgmailto: se...@apache.orgmailto:se...@apache.org wrote: I think that the Samoa people have quite a few nice examples along the lines of model training with feedback. @Paris: What would be the simplest example? On Mon, Feb 23, 2015 at 11:27 AM, Szabó Péter nemderogator...@gmail.com mailto:nemderogator...@gmail.com mailto:nemderogator...@gmail.com wrote: Does everyone know of a good, simple and realistic streaming iteration example? The current example tests a random generator, but it should be replaced by something deterministic in order to be testable. Peter
Re: Flink Streaming parallelism bug report
I can't look at it at the moment, I am on vacation and don't have my laptop. On Feb 27, 2015 9:41 AM, Szabó Péter nemderogator...@gmail.com wrote: Okay, thanks! In my case, I tried to run an ITCase test and the environment parallelism is happened to be -1, and an exception was thrown. The other ITCases ran properly, so I figured, the problem is with the windowing. Can you check it out for me? (WindowedDataStream, line 348) Peter 2015-02-27 10:06 GMT+01:00 Gyula Fóra gyf...@apache.org: They should actually return different values in many cases. Datastream.env.getDegreeOfParallelism returns the environment parallelism (default) Datastream.getparallelism() returns the parallelism of the operator. There is a reason when one or the other is used. Please watch out when you try to modify that because you might actually break functionality there :p On Feb 27, 2015 8:55 AM, Szabó Péter nemderogator...@gmail.com wrote: As I know, the time of creation of the execution environment has been slightly modified in the streaming API, which caused that dataStream.getParallelism() and dataStream.env.getDegreeOfParallelism() may return different values. Usage of the former is recommended. In theory, the latter is eliminated from the code, but there might be some more left, hiding. I've recently fixed one in WindowedDataStream. If you encounter problems with the parallelism, it may be the cause. Peter
[jira] [Created] (FLINK-1615) Introduces a new InputFormat for Tweets
mustafa elbehery created FLINK-1615: --- Summary: Introduces a new InputFormat for Tweets Key: FLINK-1615 URL: https://issues.apache.org/jira/browse/FLINK-1615 Project: Flink Issue Type: New Feature Components: flink-contrib Affects Versions: 0.8.1 Reporter: mustafa elbehery Priority: Minor An event-driven parser for Tweets into Java Pojos. It parses all the important part of the tweet into Java objects. Tested on cluster and the performance in pretty well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Tweets Custom Input Format
I'm glad you've found the how to contribute guide. I can not describe the process to open a pull request better than already written in the guide. Maybe this link is also helpful for you: https://help.github.com/articles/creating-a-pull-request/ Are you facing a particular error message? Maybe that helps me to help you better. On Fri, Feb 27, 2015 at 10:46 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Actually I am reading How to contribute now to push the code. Its working and tested locally and on the cluster, and i have used it for an ETL. The structure as follow :- Java Pojos for the tweet object, and the nested objects. Parser class using event-driven approach, and the SimpleTweetInputFormat itself. Would you guide me how to push the code, just to save sometime :) On Fri, Feb 27, 2015 at 10:42 AM, Robert Metzger rmetz...@apache.org wrote: Hi, cool! Can you generalize the input format to read JSON into an arbitrary POJO? It would be great if you could contribute the InputFormat into the flink-contrib module. I've seen many users reading JSON data with Flink, so its good to have a standard solution for that. If you want you can add the Tweet into POJO as an example into flink-contrib. On Fri, Feb 27, 2015 at 10:37 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Hi, I am really sorry for being so late, it was a whole month of projects and examination, I was really busy. @Robert, it is IF for reading tweet into Pojo. I use an event-driven parser, I retrieve most of the tweet into Java Pojos, it was tested on 1TB dataset, for a Flink ETL job, and the performance was pretty good. On Sun, Jan 25, 2015 at 7:38 PM, Robert Metzger rmetz...@apache.org wrote: Hey, is it a input format for reading JSON data or an IF for reading tweets in some format into a pojo? I think a JSON Input Format would be something very useful for our users. Maybe you can add that and use the Tweet IF as a concrete example for that? Do you have a preview of the code somewhere? Best, Robert On Sat, Jan 24, 2015 at 11:06 AM, Fabian Hueske fhue...@gmail.com wrote: Hi Mustafa, that would be a nice contribution! We are currently discussing how to add non-core API features into Flink [1]. I will move this discussion onto the mailing list to decide where to add cool add-ons like yours. Cheers, Fabian [1] https://issues.apache.org/jira/browse/FLINK-1398 2015-01-23 20:42 GMT+01:00 Henry Saputra henry.sapu...@gmail.com : Contributions are welcomed! Here is the link on how to contribute to Apache Flink: http://flink.apache.org/how-to-contribute.html You can start by creating JIRA ticket [1] to help describe what you wanted to do and to get feedback from community. - Henry [1] https://issues.apache.org/jira/secure/Dashboard.jspa On Fri, Jan 23, 2015 at 10:54 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Hi, I have created a custom InputFormat for tweets on Flink, based on JSON-Simple event driven parser. I would like to contribute my work into Flink, Regards. -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87
Re: Tweets Custom Input Format
@Robert, Thanks I was asking about the procedure. I have opened a Jira ticket for Flink-Contrib and I will create a PR with the naming convention on Wiki, https://issues.apache.org/jira/browse/FLINK-1615, On Fri, Feb 27, 2015 at 11:55 AM, Robert Metzger rmetz...@apache.org wrote: I'm glad you've found the how to contribute guide. I can not describe the process to open a pull request better than already written in the guide. Maybe this link is also helpful for you: https://help.github.com/articles/creating-a-pull-request/ Are you facing a particular error message? Maybe that helps me to help you better. On Fri, Feb 27, 2015 at 10:46 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Actually I am reading How to contribute now to push the code. Its working and tested locally and on the cluster, and i have used it for an ETL. The structure as follow :- Java Pojos for the tweet object, and the nested objects. Parser class using event-driven approach, and the SimpleTweetInputFormat itself. Would you guide me how to push the code, just to save sometime :) On Fri, Feb 27, 2015 at 10:42 AM, Robert Metzger rmetz...@apache.org wrote: Hi, cool! Can you generalize the input format to read JSON into an arbitrary POJO? It would be great if you could contribute the InputFormat into the flink-contrib module. I've seen many users reading JSON data with Flink, so its good to have a standard solution for that. If you want you can add the Tweet into POJO as an example into flink-contrib. On Fri, Feb 27, 2015 at 10:37 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Hi, I am really sorry for being so late, it was a whole month of projects and examination, I was really busy. @Robert, it is IF for reading tweet into Pojo. I use an event-driven parser, I retrieve most of the tweet into Java Pojos, it was tested on 1TB dataset, for a Flink ETL job, and the performance was pretty good. On Sun, Jan 25, 2015 at 7:38 PM, Robert Metzger rmetz...@apache.org wrote: Hey, is it a input format for reading JSON data or an IF for reading tweets in some format into a pojo? I think a JSON Input Format would be something very useful for our users. Maybe you can add that and use the Tweet IF as a concrete example for that? Do you have a preview of the code somewhere? Best, Robert On Sat, Jan 24, 2015 at 11:06 AM, Fabian Hueske fhue...@gmail.com wrote: Hi Mustafa, that would be a nice contribution! We are currently discussing how to add non-core API features into Flink [1]. I will move this discussion onto the mailing list to decide where to add cool add-ons like yours. Cheers, Fabian [1] https://issues.apache.org/jira/browse/FLINK-1398 2015-01-23 20:42 GMT+01:00 Henry Saputra henry.sapu...@gmail.com : Contributions are welcomed! Here is the link on how to contribute to Apache Flink: http://flink.apache.org/how-to-contribute.html You can start by creating JIRA ticket [1] to help describe what you wanted to do and to get feedback from community. - Henry [1] https://issues.apache.org/jira/secure/Dashboard.jspa On Fri, Jan 23, 2015 at 10:54 AM, Mustafa Elbehery elbeherymust...@gmail.com wrote: Hi, I have created a custom InputFormat for tweets on Flink, based on JSON-Simple event driven parser. I would like to contribute my work into Flink, Regards. -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87 -- Mustafa Elbehery EIT ICT Labs Master School http://www.masterschool.eitictlabs.eu/home/ +49(0)15218676094 skype: mustafaelbehery87
Re: [DISCUSS] URI NullPointerException in TestBaseUtils
Yeah, I agree, it is at best a cosmetic issue. I just wanted to let you know about it. Peter 2015-02-27 11:10 GMT+01:00 Till Rohrmann trohrm...@apache.org: Catching the NullPointerException and throwing an IllegalArgumentException with a meaningful message might clarify things. Considering that it only affects the TestBaseUtils, it should not be big deal to change it. On Fri, Feb 27, 2015 at 10:30 AM, Szabó Péter nemderogator...@gmail.com wrote: The following code snippet in from TestBaseUtils: protected static File asFile(String path) { try { URI uri = new URI(path); if (uri.getScheme().equals(file)) { return new File(uri.getPath()); } else { throw new IllegalArgumentException(This path does not denote a local file.); } } catch (URISyntaxException e) { throw new IllegalArgumentException(This path does not describe a valid local file URI.); } } If uri does not have a scheme (e.g. /home/something.txt), uri.getScheme().equals(file) throws a NullPointerException instead of an IllegalArgumentException is thrown. I feel it would make more sense to catch the NullPointerException at the end. What do you guys think? Peter
Re: Flink Streaming parallelism bug report
No problem. I will not commit the modification until it is clarified. Peter 2015-02-27 10:48 GMT+01:00 Gyula Fóra gyf...@apache.org: I can't look at it at the moment, I am on vacation and don't have my laptop. On Feb 27, 2015 9:41 AM, Szabó Péter nemderogator...@gmail.com wrote: Okay, thanks! In my case, I tried to run an ITCase test and the environment parallelism is happened to be -1, and an exception was thrown. The other ITCases ran properly, so I figured, the problem is with the windowing. Can you check it out for me? (WindowedDataStream, line 348) Peter 2015-02-27 10:06 GMT+01:00 Gyula Fóra gyf...@apache.org: They should actually return different values in many cases. Datastream.env.getDegreeOfParallelism returns the environment parallelism (default) Datastream.getparallelism() returns the parallelism of the operator. There is a reason when one or the other is used. Please watch out when you try to modify that because you might actually break functionality there :p On Feb 27, 2015 8:55 AM, Szabó Péter nemderogator...@gmail.com wrote: As I know, the time of creation of the execution environment has been slightly modified in the streaming API, which caused that dataStream.getParallelism() and dataStream.env.getDegreeOfParallelism() may return different values. Usage of the former is recommended. In theory, the latter is eliminated from the code, but there might be some more left, hiding. I've recently fixed one in WindowedDataStream. If you encounter problems with the parallelism, it may be the cause. Peter
Re: [DISCUSS] URI NullPointerException in TestBaseUtils
Catching the NullPointerException and throwing an IllegalArgumentException with a meaningful message might clarify things. Considering that it only affects the TestBaseUtils, it should not be big deal to change it. On Fri, Feb 27, 2015 at 10:30 AM, Szabó Péter nemderogator...@gmail.com wrote: The following code snippet in from TestBaseUtils: protected static File asFile(String path) { try { URI uri = new URI(path); if (uri.getScheme().equals(file)) { return new File(uri.getPath()); } else { throw new IllegalArgumentException(This path does not denote a local file.); } } catch (URISyntaxException e) { throw new IllegalArgumentException(This path does not describe a valid local file URI.); } } If uri does not have a scheme (e.g. /home/something.txt), uri.getScheme().equals(file) throws a NullPointerException instead of an IllegalArgumentException is thrown. I feel it would make more sense to catch the NullPointerException at the end. What do you guys think? Peter
[jira] [Created] (FLINK-1614) JM Webfrontend doesn't always show the correct state of Tasks
Robert Metzger created FLINK-1614: - Summary: JM Webfrontend doesn't always show the correct state of Tasks Key: FLINK-1614 URL: https://issues.apache.org/jira/browse/FLINK-1614 Project: Flink Issue Type: Bug Components: JobManager, Webfrontend Affects Versions: 0.9 Reporter: Robert Metzger There is at least one reported case where the web interface shows that a data source is still running while the sink is already in finished state. The log file indicates that the web interface is not correctly reporting the execution states of tasks because there the tasks finish correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Could not build up connection to JobManager
It depends on how you started Flink. If you started a local cluster, then the TaskManager log is contained in the JobManager log we just don't see the respective log output in the snippet you posted. If you started a TaskManager independently, either by taskmanager.sh or by start-cluster.sh, then a file with the name format flink-user-taskmanager-hostname.log should be created in flink/log/. If the Flink directory is not shared by your cluster nodes, then you have to look on the machine on which you started the TaskManager. But since the JobManager binds to 127.0.0.1 I guess that you started a local cluster. Try whether you find some logging statements from the logger org.apache.flink.runtime.taskmanager.TaskManager in your log. Maybe you can upload the corresponding log file to [1] and post a link here. Greets, Till [1] https://gist.github.com/ On Thu, Feb 26, 2015 at 6:45 PM, Dulaj Viduranga vidura...@icloud.com wrote: Hi, Can you tell me where I can find TaskManager logs. I can’t find them in logs folder? I don’t suppose I should run taskmanager.sh as well. Right? I’m using a OS X Yosemite. I’ll send you my ifconfig. lo0: flags=8049UP,LOOPBACK,RUNNING,MULTICAST mtu 16384 options=3RXCSUM,TXCSUM inet6 ::1 prefixlen 128 inet 127.0.0.1 netmask 0xff00 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 nd6 options=1PERFORMNUD gif0: flags=8010POINTOPOINT,MULTICAST mtu 1280 stf0: flags=0 mtu 1280 en0: flags=8823UP,BROADCAST,SMART,SIMPLEX,MULTICAST mtu 1500 ether 60:03:08:a1:e0:f4 nd6 options=1PERFORMNUD media: autoselect (unknown type) status: inactive en1: flags=8963UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST mtu 1500 options=60TSO4,TSO6 ether 72:00:02:32:14:d0 media: autoselect full-duplex status: inactive en2: flags=8963UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST mtu 1500 options=60TSO4,TSO6 ether 72:00:02:32:14:d1 media: autoselect full-duplex status: inactive bridge0: flags=8822BROADCAST,SMART,SIMPLEX,MULTICAST mtu 1500 options=63RXCSUM,TXCSUM,TSO4,TSO6 ether 62:03:08:1a:fa:00 Configuration: id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0 maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200 root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0 ipfilter disabled flags 0x2 member: en1 flags=3LEARNING,DISCOVER ifmaxaddr 0 port 5 priority 0 path cost 0 member: en2 flags=3LEARNING,DISCOVER ifmaxaddr 0 port 6 priority 0 path cost 0 media: unknown type status: inactive p2p0: flags=8802BROADCAST,SIMPLEX,MULTICAST mtu 2304 ether 02:03:08:a1:e0:f4 media: autoselect status: inactive awdl0: flags=8802BROADCAST,SIMPLEX,MULTICAST mtu 1452 ether 06:56:3d:f6:60:08 nd6 options=1PERFORMNUD media: autoselect status: inactive ppp0: flags=8051UP,POINTOPOINT,RUNNING,MULTICAST mtu 1500 inet 10.218.98.228 -- 10.64.64.64 netmask 0xff00 utun0: flags=8051UP,POINTOPOINT,RUNNING,MULTICAST mtu 1380 inet6 fe80::b0d4:d4be:7e62:e730%utun0 prefixlen 64 scopeid 0xb inet6 fdd0:b291:7da7:9153:b0d4:d4be:7e62:e730 prefixlen 64 nd6 options=1PERFORMNUD On Feb 26, 2015, at 10:48 PM, Stephan Ewen se...@apache.org wrote: Hi Dulaj! Thanks for helping to debug. My guess is that you are seeing now the same thing between JobManager and TaskManager as you saw before between JobManager and JobClient. I have a patch pending that should help the issue (see https://issues.apache.org/jira/browse/FLINK-1608), let's see if that solves it. What seems not right is that the JobManager initially accepted the TaskManager and later the communication. Can you paste the TaskManager log as well? Also: There must be something fairly unique about your network configuration, as it works on all other setups that we use (locally, cloud, test servers, YARN, ...). Can you paste your ipconfig / ifconfig by any chance? Greetings, Stephan On Thu, Feb 26, 2015 at 4:33 PM, Dulaj Viduranga vidura...@icloud.com wrote: Hi, It’s great to help out. :) Setting 127.0.0.1 instead of “localhost” in jobmanager.rpc.address, helped to build the connection to the jobmanager. Apparently localhost resolving is different in webclient and the jobmanager. I think it’s good to set jobmanager.rpc.address: 127.0.0.1 in future builds. But then I get this error when I tried to run examples. I don’t know if I should move this issue to another thread. If so please tell me. bin/flink run