Re: Discussion about a Flink DataSource repository

2016-05-04 Thread Fabian Hueske
Hi Flavio, I thought a bit about your proposal. I am not sure if it is actually necessary to integrate a central source repository into Flink. It should be possible to offer this as an external service which is based on the recently added TableSource interface. TableSources could be extended to

Re: Creating a custom operator

2016-05-03 Thread Fabian Hueske
orrect and > custom runtime operators are not supposed to be implemented outside of > Flink. > > Thanks, > > Simone > > 2016-04-29 21:32 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > >> Hi Simone, >> >> the GraphCreatingVisitor transforms the com

Re: Sink - writeAsText problem

2016-05-03 Thread Fabian Hueske
Yes, but be aware that your program runs with parallelism 1 if you do not configure the parallelism. 2016-05-03 11:07 GMT+02:00 Punit Naik : > Hi Stephen, Fabian > > setting "fs.output.always-create-directory" to true in flink-config.yml > worked! > > On Tue, May 3, 2016

Re: Sink - writeAsText problem

2016-05-03 Thread Fabian Hueske
Did you specify a parallelism? The default parallelism of a Flink instance is 1 [1]. You can set a different default parallelism in ./conf/flink-conf.yaml or pass a job specific parallelism with ./bin/flink using the -p flag [2]. More options to define parallelism are in the docs [3]. [1]

Re: S3 Checkpoint Storage

2016-05-02 Thread Fabian Hueske
Hi John, S3 keys are configured via Hadoop's configuration files. Check out the documentation for AWS setups [1]. Cheers, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html 2016-05-02 20:22 GMT+02:00 John Sherwood : > Hello all, > > I'm

Re: Perform a groupBy on an already groupedDataset

2016-05-02 Thread Fabian Hueske
Grouping a grouped dataset is not supported. You can group on multiple keys: dataSet.groupBy(1,2). Can you describe your use case if that does not solve the problem? 2016-05-02 10:34 GMT+02:00 Punit Naik : > Hello > > I wanted to perform a groupBy on an already grouped

Re: Any way for Flink Elasticsearch connector reflecting IP change of Elasticsearch cluster?

2016-05-02 Thread Fabian Hueske
Yes, it looks like the connector only creates the connection once when it starts and fails if the host is no longer reachable. It should be possible to catch that failure and try to re-open the connection. I opened a JIRA for this issue (FLINK-3857). Would you like to implement the improvement?

Re: Unable to write stream as csv

2016-05-02 Thread Fabian Hueske
Have you checked the log files as well? 2016-05-01 14:07 GMT+02:00 subash basnet : > Hello there, > > If anyone could help me know why the below *result* DataStream get's > written as text, but not as csv?. As it's in a tuple format I guess it > should be the same for both

Re: Count of Grouped DataSet

2016-05-02 Thread Fabian Hueske
Hi Nirmalya, the solution with List.size() won't use a combiner and won't be efficient for large data sets with large groups. I would recommend to add a 1 and use GroupedDataSet.sum(). 2016-05-01 12:48 GMT+02:00 nsengupta : > Hello all, > > This is how I have moved

Re: EMR vCores and slot allocation

2016-05-02 Thread Fabian Hueske
The slot configuration should depend on the complexity of jobs. Since each slot runs a "slice" of a program, one slot might potentially execute many concurrent tasks. For complex jobs you should allocate more than one core for each slot. 2016-05-02 10:12 GMT+02:00 Robert Metzger

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
I was going to increase the Job Manager heap to 3 GB and maybe change some > gc setting. > Do you think I should increase also the akka timeout or other things? > > On Thu, Apr 28, 2016 at 2:06 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hmm, 113k splits

Re: General Data questions - streams vs batch

2016-04-28 Thread Fabian Hueske
This is 21),(23,This is 23),(25,This is 25),(27,This is > 27),(29,This is 29) > ! 36: (20,This is 20),(22,This is 22),(24,This is 24),(26,This is > 26),(28,This is 28) > > > And if you can give a bit more info on why will I have latency issues in a > case of varying rate of arri

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
sted the wrong TM heap size...that is indeed 3Gb ( > taskmanager.heap.mb:512) > > Best, > Flavio > > On Thu, Apr 28, 2016 at 1:29 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Is the problem reproducible? >> Maybe the SplitAssigner gets stuck somehow, but I've n

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
gt; Do you see anything sospicious? > > Thanks for the support, > Flavio > > On Thu, Apr 28, 2016 at 11:54 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> I checked the input format from your PR, but didn't see anything >> suspicious. >> >> It is def

Re: General Data questions - streams vs batch

2016-04-28 Thread Fabian Hueske
Hi Konstantin, if you do not need a deterministic grouping of elements you should not use a keyed stream or window. Instead you can do the lookups in a parallel flatMap function. The function would collect arriving elements and perform a lookup query after a certain number of elements arrived

Re: Configuring a RichFunction on a DataStream

2016-04-28 Thread Fabian Hueske
Hi Robert, Function configuration via a Configuration object and the open method is an artifact from the past. The recommended way is to configure the function object via the constructor. Flink serializes the function object and ships them to the workers for execution. So the state of a function

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
I checked the input format from your PR, but didn't see anything suspicious. It is definitely OK if the processing of an input split tasks more than 10 seconds. That should not be the cause. It rather looks like the DataSourceTask fails to request a new split from the JobManager. 2016-04-28 9:37

Re: About flink stream table API

2016-04-28 Thread Fabian Hueske
Hi, Table API and SQL for streaming are work in progress. A first version which supports projection, filter, and union is merged to the master branch. Under the hood, Flink uses Calcite to optimize and translate Table API and SQL queries. Best, Fabian 2016-04-27 14:27 GMT+02:00 Zhangrucong

Re: Job hangs

2016-04-27 Thread Fabian Hueske
Hi Timur, I had a look at the plan you shared. I could not find any flow that branches and merges again, a pattern which is prone to cause a deadlocks. However, I noticed that the plan performs a lot of partitioning steps. You might want to have a look at forwarded field annotations which can

Re: Tuning parallelism in cascading-flink planner

2016-04-27 Thread Fabian Hueske
Hi Ken, at the moment, there are just two parameters to control the parallelism of Flink operators generated by the Cascading-Flink connector. The parameters are: - flink.num.sourceTasks to specify the parallelism of source tasks. - flink.num.shuffleTasks to specify the parallelism of all

Re: Return unique counter using groupReduceFunction

2016-04-26 Thread Fabian Hueske
Hi Biplob, Flink is a distributed, data parallel system which means that there are several instances of you ReduceFunction running in parallel, each with its own timestamp counter. If you want to have a unique timestamp, you have to set the parallelism of the reduce operator to 1, but then the

Re: Flink first() operator

2016-04-26 Thread Fabian Hueske
Actually, memory should not be a problem since the full data set would not be materialized in memory. Flink has a streaming runtime so most of the data would be immediately filtered out. However, reading the whole file causes of course a lot of unnecessary IO. 2016-04-26 17:09 GMT+02:00 Biplob

Re: Access to a shared resource within a mapper

2016-04-25 Thread Fabian Hueske
Hi Timur, a TaskManager may run as many subtasks of a Map operator as it has slots. Each subtask of an operator runs in a different thread. Each parallel subtask of a Map operator has its own MapFunction object, so it should be possible to use a lazy val. However, you should not use static

Re: Flink first() operator

2016-04-25 Thread Fabian Hueske
Hi Biplop, you can also implement a generic IF that wraps another IF (such as a CsvInputFormat). The wrapping IF forwards all calls to the wrapped IF and in addition counts how many records were emitted (how often InputFormat.nextRecord() was called). Once the count arrives at the threshold, it

Re: java.lang.ClassCastException: org.apache.flink.streaming.api.watermark.Watermark cannot be cast to org.apache.flink.streaming.runtime.streamrecord.StreamRecord

2016-04-22 Thread Fabian Hueske
Hi Konstantin, this exception is thrown if you do not set the time characteristic to event time and assign timestamps. Please try to add > env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) after you obtained the StreamExecutionEnvironment. Best, Fabian 2016-04-22 15:47 GMT+02:00

Re: implementing a continuous time window

2016-04-22 Thread Fabian Hueske
to make sure that your operator can recover from failures. Cheers, Fabian 2016-04-21 23:16 GMT+02:00 Jonathan Yom-Tov <jon.yom...@gmail.com>: > Thanks. Any pointers on how to do that? Or code examples which do similar > things? > > On Thu, Apr 21, 2016 at 10:30 PM, F

Re: implementing a continuous time window

2016-04-21 Thread Fabian Hueske
Yes, sliding windows are different. You want to evaluate the window whenever a new element arrives or an element leaves because 5 secs passed since it entered the window, right? I think that should be possible with a GlobalWindow, a custom Trigger which holds state about the time when each

Re: Explanation on limitations of the Flink Table API

2016-04-21 Thread Fabian Hueske
Hi Simone, in Flink 1.0.x, the Table API does not support reading external data, i.e., it is not possible to read a CSV file directly from the Table API. Tables can only be created from DataSet or DataStream which means that the data is already converted into "Flink types". However, the Table

Re: Set up Flink Cluster on Windows machines

2016-04-20 Thread Fabian Hueske
Hi Yifei, I think this has not been done before. At least I am not aware of anybody running Flink in cluster mode on Windows. In principle this should work. It is possible to start a local instance on Windows (start-local.bat) and to locally execute Flink programs on this instance using the

Re: Sink Parallelism

2016-04-20 Thread Fabian Hueske
nbounded all elements that arrive in the first window > are grouped/partitioned by keys and aggregated and so on until no more > streams left. The global result then has the aggregated key/value pairs. > > Kind Regards, > Ravinder Kaur > > > > On Wed, Apr 20, 2016 at 12:12 PM,

Re: Operation of Windows and Triggers

2016-04-20 Thread Fabian Hueske
Hi Piyush, if you explicitly set a trigger, the default trigger of the window is replaced. In your example, the time trigger is replaced by the count trigger, i.e., the window is only evaluated after the 100th element was received. This blog post discusses windows and triggers [1]. Best, Fabian

Re: Sink Parallelism

2016-04-20 Thread Fabian Hueske
Hi Ravinder, your drawing is pretty much correct (Flink will inject a combiner between flat map and reduce which locally combines records with the same key). The partitioning between flat map and reduce is done with hash partitioning by default. However, you can also define a custom partitioner

Re: Hash tables - joins, cogroup, deltaIteration

2016-04-19 Thread Fabian Hueske
Hi Ovidiu, Hash tables are currently used for joins (inner & outer) and the solution set of delta iterations. There is a pending PR that implements a hash table for partial aggregations (combiner) [1] which should be added soon. Joins (inner & outer) are already implemented as Hybrid Hash joins

Re: Powered by Flink

2016-04-12 Thread Fabian Hueske
t; Regards, >>>>> Robert >>>>> >>>>> [1] >>>>> https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink >>>>> >>>>> >>>>> On Mon, Oct 19, 2015 at 4:10 PM, Matthias J. Sax <mj...@apache.org >>>>>

Re: Find differences

2016-04-07 Thread Fabian Hueske
I would go with an outer join as Stefano suggested. Outer joins can be executed as hash joins which will probably be more efficient than using a sort based groupBy/reduceGroup. Also outer joins are a more intuitive and simpler, IMO. 2016-04-07 12:35 GMT+02:00 Stefano Baghino

AW: Window Support in Flink

2016-03-28 Thread Fabian Hueske
Hopping windows is a term used on the Apache Calcite website [1]. In Flink terms, hopping windows are sliding windows. Cheers, Fabian [1] http://calcite.apache.org/docs/stream.html Von: Ufuk Celebi Gesendet: Montag, 28. März 2016 12:40 An: user@flink.apache.org Betreff: Re: Window Support in

Re: Flink 1.0.0 reading files from multiple directory with wildcards

2016-03-22 Thread Fabian Hueske
r...@teamaol.com> wrote: > >> Fabian, >> >> I'll try extending InputFormat as you suggested and will create a JIRA >> issue as well. >> >> I also have an AvroGenericRecordInput format class that I would like to >> contribute once I have time to clean i

Re: Flink 1.0.0 reading files from multiple directory with wildcards

2016-03-21 Thread Fabian Hueske
Hi, no, this is currently not supported. However, I agree this would be a very valuable addition to the FileInputFormat. Would you mind opening a JIRA issue with your suggestions? Until this is added to Flink, it can be implemented as a custom InputFormat based on FileInputFormat by overriding

Re: Setting taskmanager.network.numberOfBuffers does not seem to have an affect - Flink 0.10.2

2016-03-21 Thread Fabian Hueske
Hi, right now there is no way to sequentially execute the input tasks. Flink's FileInputFormat does also not support multiple paths out of the box. However, it is certainly possible to extend the FileInputFormat such that this is possible. You would need to override / extend the

Re: degree of Parallelism

2016-03-19 Thread Fabian Hueske
Hi, did find the documentation for configuring the parallelism [1]? It explains how to set the parallelism on different levels: Cluster, Job, Task. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#parallel-execution 2016-03-18 13:34 GMT+01:00

Re: off-heap size feature request

2016-03-19 Thread Fabian Hueske
eInBytes), along with OS cache. > > I would like parameters like: > taskmanager.off-heap.size or taskmanager.off-heap.fraction > taskmanager.off-heap.enabled true or false > and same for heap. > > Thanks for clarification. > > Best, > Ovidiu > > > On 16 Mar 2016, at 13:43,

Re: what happens to the aggregate in a fold on a KeyedStream if the events for a key stop coming ?

2016-03-19 Thread Fabian Hueske
Hi Bart, if you run a fold function on a keyed stream without a window, there is no way to remove the key and the folded value. You will eventually run out of memory if your key space is continuously growing. If you apply a fold function in a window on a keyed stream you can bound the "lifetime"

Re: off-heap size feature request

2016-03-19 Thread Fabian Hueske
se-1.0/setup/config.html#managed-memory > > Best, > Ovidiu > > On 16 Mar 2016, at 12:13, Fabian Hueske <fhue...@gmail.com> wrote: > > Hi Ovidiu, > > the parameters to configure the amount of managed memory > (taskmanager.memory.size, > taskmanager.memory.

Re: what happens to the aggregate in a fold on a KeyedStream if the events for a key stop coming ?

2016-03-18 Thread Fabian Hueske
- > Bart van Deenen > bartvandee...@fastmail.fm > > > > On Fri, Mar 18, 2016, at 11:54, Fabian Hueske wrote: > > Hi Bart, > if you run a fold function on a keyed stream without a window, there is no > way to remove the key and the folded value. > You will eve

Re: Input on training exercises

2016-03-18 Thread Fabian Hueske
Hi Ken, you can open an issue on the Github repository or send a mail to me. Thanks, Fabian 2016-03-17 23:07 GMT+01:00 Ken Krugler : > Hi list, > > What's the right way to provide input on the training exercises >

Re: off-heap size feature request

2016-03-16 Thread Fabian Hueske
Hi Ovidiu, the parameters to configure the amount of managed memory (taskmanager.memory.size, taskmanager.memory.fraction) are valid for on and off-heap memory. Have you tried these parameters and didn't they work as expected? Best, Fabian 2016-03-16 11:43 GMT+01:00 Ovidiu-Cristian MARCU <

Re: Using a POJO class wrapping an ArrayList

2016-03-16 Thread Fabian Hueske
Hi Mengqi, I did not completely understand your use case. If you would like to use a composite key (a key with multiple fields) there are two alternatives: - use a tuple as key type. This only works if all records have the same number of key fields. Tuple serialization and comparisons are very

Re: Memory ran out PageRank

2016-03-16 Thread Fabian Hueske
Hi Ovidiu, putting the CompactingHashTable aside, all data structures and algorithms that use managed memory can spill to disk if data exceeds memory capacity. It was a conscious choice to not let the CompactingHashTable spill. Once the solution set hash table is spilled, (parts of) the hash

Re: Accumulators checkpointed?

2016-03-15 Thread Fabian Hueske
Hi Zach, at the moment, accumulators are not checkpointed and reset if if a failed task is restarted. Best, Fabian 2016-03-15 17:27 GMT+01:00 Zach Cox : > Are accumulators stored in checkpoint state? If a job fails and restarts, > are all accumulator values lost, or are they

Re: Flink loading an S3 File out of order

2016-03-10 Thread Fabian Hueske
Hi Benjamin, Flink reads data usually in parallel. This is done by splitting the input (e.g., a file) into several input splits. Each input split is independently processed. Since splits are usually concurrently processed by more than one task, Flink does not care about the order by default. You

Re: protobuf messages from Kafka to elasticsearch using flink

2016-03-09 Thread Fabian Hueske
Hi, I haven't used protobuf to serialize Kafka events but this blog post (+ the linked repository) shows how to write data from Flink into Elasticsearch: --> https://www.elastic.co/blog/building-real-time-dashboard-applications-with-apache-flink-elasticsearch-and-kibana Hope this helps, Fabian

Re: Flink and Directory Monitors

2016-03-09 Thread Fabian Hueske
Hi Philippe, I am not aware of anybody using Directory Monitor with Flink. However, the application you described sounds reasonable and I think it should be possible to implement that with Flink. You would need to implement a SourceFunction that forwards events from DM to Flink or you push the

Re: Submit Flink Jobs to YARN running on AWS

2016-03-09 Thread Fabian Hueske
Hi Abhi, I have used Flink on EMR via YARN a couple of times without problems. I started a Flink YARN session like this: ./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096 This will start five YARN containers (1 JobManager with 1024MB, 4 Taskmanagers with 4096MB). See more config options in the

Re: Setting taskmanager.network.numberOfBuffers and getting errors...

2016-03-03 Thread Fabian Hueske
Hi Sourigna, you are using the formula correctly: #cores should to be translated into slots per taskmanager (TM), and #machines into number of TMs. So 36 ^ 2 * 10 * 4 = 51840 appears to be right. The constant 4 refers to the total number of concurrently active full network shuffles (partitioning

Re: Multi-dimensional[more than 2] input for KMeans Clustering in Apache flink

2016-03-01 Thread Fabian Hueske
Hi Subash, the KMeans implementation in Flink is meant to be a simple toy example and should not used for serious analysis tasks. It shows how the DataSet API works by implementing a well-known algorithm. Nonetheless, the example can be easily extended to work for three or more dimensions. You

Re: Iterations problem in command line

2016-03-01 Thread Fabian Hueske
till getting the same issue. > > Regards, > MArcela. > > > On 29.02.2016 16:44, Fabian Hueske wrote: > >> Hi Marcela, >> >> do you run the algorithm in both setups with the same parallelism? >> >> Best, Fabian >> >> 2016-02-26 16:52 GMT+

Re: Optimal Configuration for Cluster

2016-02-22 Thread Fabian Hueske
possible for DataSet (batch) programs. Hope this helps, Fabian 2016-02-22 11:26 GMT+01:00 Fabian Hueske <fhue...@gmail.com>: > Hi Welly, > > sorry for the late response. > > The number of network buffers primarily depends on the maximum parallelism > of your job. > The given

Re: Optimal Configuration for Cluster

2016-02-22 Thread Fabian Hueske
Hi Welly, sorry for the late response. The number of network buffers primarily depends on the maximum parallelism of your job. The given formula assumes a specific cluster configuration (1 task manager per machine, one parallel task per CPU). The formula can be translated to:

Re: Read once input data?

2016-02-16 Thread Fabian Hueske
ouble value representing > the average in the previous example. > > > On Tue, Feb 16, 2016 at 3:47 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> You can use so-called BroadcastSets to send any sufficiently small >> DataSet (such as a computed average) to any other f

Re: Read once input data?

2016-02-16 Thread Fabian Hueske
doing multiple reads of the input data to create > the same dataset? > > Thank you, > saliya > > On Tue, Feb 16, 2016 at 12:03 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Yes, if you implement both maps in a single job, data is read once. >> >>

Re: writeAsCSV with partitionBy

2016-02-16 Thread Fabian Hueske
> > Assuming field0 is Int and has unique values 1,2,3&4. > > Srikanth > > > On Mon, Feb 15, 2016 at 6:20 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Srikanth, >> >> DataSet.partitionBy() will partition the data on the declared partition >>

Re: Read once input data?

2016-02-16 Thread Fabian Hueske
at everything goes as > single job in Flink, so data read happens only once? > > Thanks, > Saliya > > On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> It is not possible to "pin" data sets in memory, yet. >> Howe

Re: Read once input data?

2016-02-15 Thread Fabian Hueske
. > > Any chance you might have an example on how to define a data flow with > Flink? > > > > On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> It is not possible to "pin" data sets in memory, yet. >> However, y

Re: Read once input data?

2016-02-15 Thread Fabian Hueske
Hi, it looks like you are executing two distinct Flink jobs. DataSet.count() triggers the execution of a new job. If you have an execute() call in your program, this will lead to two Flink jobs being executed. It is not possible to share state among these jobs. Maybe you should add a custom

Re: Problem with KeyedStream 1.0-SNAPSHOT

2016-02-15 Thread Fabian Hueske
Hi Javier, Keys is an internal class and was recently moved to a different package. So it appears like your Flink dependencies are not aligned to the same version. We also added Scala version identifiers to all our dependencies which depend on Scala 2.10. For instance, flink-scala became

Re: schedule tasks `inside` Flink

2016-02-15 Thread Fabian Hueske
Hi Michal, If I got your requirements right, you could try to solve this issue by serving the updates through a regular DataStream. You could add a SourceFunction which periodically emits a new version of the cache and a CoFlatMap operator which receives on the first input the regular streamed

Re: Regarding Concurrent Modification Exception

2016-02-15 Thread Fabian Hueske
Hi, This stacktrace looks really suspicious. It includes classes from the submission client (CLIClient), optimizer (JobGraphGenerator), and runtime (KryoSerializer). Is it possible that you try to start a new Flink job inside another job? This would not work. Best, Fabian

Re: writeAsCSV with partitionBy

2016-02-15 Thread Fabian Hueske
Hi Srikanth, DataSet.partitionBy() will partition the data on the declared partition fields. If you append a DataSink with the same parallelism as the partition operator, the data will be written out with the defined partitioning. It should be possible to achieve the behavior you described using

Re: Merge or minus Dataset API missing

2016-02-12 Thread Fabian Hueske
Hi Flavio, If I got it right, you can use a FullOuterJoin. It will give you both elements on a match and otherwise a left or a right element and null. Best, Fabian 2016-02-12 16:48 GMT+01:00 Flavio Pompermaier : > Hi to all, > > I have a use case where I have to merge 2

Re: How to convert List to flink DataSet

2016-02-10 Thread Fabian Hueske
Hi Subash, how is findOutliers implemented? It might be that you mix-up local and cluster computation. All DataSets are processed in the cluster. Please note the following: - ExecutionEnvironment.fromCollection() transforms a client local connection into a DataSet by serializing it and sending

Re: Dataset filter improvement

2016-02-10 Thread Fabian Hueske
Hi Flavio, I did not completely understand which objects should go where, but here are some general guidelines: - early filtering is mostly a good idea (unless evaluating the filter expression is very expensive) - you can use a flatMap function to combine a map and a filter - applying multiple

Re: Dataset filter improvement

2016-02-10 Thread Fabian Hueske
e") in order to generate the typed > dataset based. > Which one do you think is the best solution? > > On Wed, Feb 10, 2016 at 9:40 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Flavio, >> >> I did not completely understand which objects should go

Re: How to convert List to flink DataSet

2016-02-10 Thread Fabian Hueske
ntWithDistance.f1.f0, elementWithDistance.f1.f1, > true); > } else { > newElement.setFields(elementWithDistance.f1.f0, elementWithDistance.f1.f1, > false); > } > finalElements.add(newElement); > } > } > return finalElements; > } > > I have attached here

Re: TaskManager unable to register with JobManager

2016-02-10 Thread Fabian Hueske
Hi Ravinder, please have a look at the configuration documentation: --> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager Best, Fabian 2016-02-10 13:55 GMT+01:00 Ravinder Kaur : > Hello All, > > I need to know the range of

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
Hi, glad you could resolve the POJO issue, but the new error doesn't look right. The CO_GROUP_RAW strategy should only be used for programs that are implemented against the Python DataSet API. I guess that's not the case since all code snippets were Java so far. Can you post the full stacktrace

Re: Kafka partition alignment for event time

2016-02-09 Thread Fabian Hueske
Hi, where did you observe the duplicates, within Flink or in Kafka? Please be aware that the Flink Kafka Producer does not provide exactly-once consistency. This is not easily possible because Kafka does not support transactional writes yet. Flink's exactly-once guarantees are only valid within

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
What is the type of sessionId? It must be a key type in order to be used as key. If it is a generic class, it must implement Comparable to be used as key. 2016-02-09 11:53 GMT+01:00 Dominique Rondé : > The fields in SourceA and SourceB are private but have public

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
String is perfectly fine as key. Looks like SourceA / SourceB are not correctly identified as Pojos. 2016-02-09 14:25 GMT+01:00 Dominique Rondé : > Sorry, i was out for lunch. Maybe the problem is that sessionID is a > String? > > public abstract class Parent{ >

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
.runtime.taskmanager.Task.run(Task.java:584) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.Exception: Unsupported driver strategy for join > driver: CO_GROUP_RAW > at > org.apache.flink.runtime.operators.JoinDriver.prepare(JoinDriver.java:193) > at org.apache.flink.runtime.operators.BatchTa

Re: Error while reading binary file

2016-02-08 Thread Fabian Hueske
Hi, please try to replace DataSet ds = env.createInput(sif); by DataSet ds = env.createInput(sif, ValueTypeInfo.SHORT_VALUE_TYPE_INFO); Best, Fabian 2016-02-08 19:33 GMT+01:00 Saliya Ekanayake : > Till, > > I am still having trouble getting this to work. Here's my code ( >

Re: Error while reading binary file

2016-02-08 Thread Fabian Hueske
pache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:169) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) > at java.lang.Thread.run(Thread.java:745) > > On Mon, Feb 8, 2016 at 3:50 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi, >> >> please try to replace >&

Re: Flink writeAsCsv

2016-02-04 Thread Fabian Hueske
You can get the end time of a window from the TimeWindow object which is passed to the AllWindowFunction. This is basically a window ID / index. I would go for a custom output sink which writes records to files based on their timestamp. IMO, this would be cleaner & easier than implementing the

Re: Build Flink for a specific tag

2016-02-03 Thread Fabian Hueske
Hi Flavio, we use tags to identify releases. The "release-0.10.1" tag, refers to the code that has been released as Flink 0.10.1. The "release-0.10" branch is used to develop 0.10 releases. Currently, it contains Flink 0.10.1 and additionally a few more bug fix commits. We will fork off this

Re: Left join with unbalanced dataset

2016-02-03 Thread Fabian Hueske
Hi Arnauld, in a previous mail you said: "Note that I did not rebuild & reinstall flink, I just used a 0.10-SNAPSHOT compiled jar submitted as a batch job using the "0.10.0" flink installation" This will not fix the Netty version error. You need to install a new Flink version or submit the Flink

Re: Where does Flink store intermediate state and triggering/scheduling a delayed event?

2016-02-03 Thread Fabian Hueske
Hi, 1) At the moment, state is kept on the JVM heap in a regular HashMap. However, we added an interface for pluggable state backends. State backends store the operator state (Flink's built-in window operators are based on operator state as well). A pull request to add a RocksDB backend (going

Re: Left join with unbalanced dataset

2016-02-03 Thread Fabian Hueske
as hoping it would suffice… > > However, I’ve just recompiled everything and ran with a real 0.10.1 > snapshot and everything worked at an astounding speed with a reasonable > memory amount. > > Thanks for the great work and the help, as always, > > Arnaud > > &g

Re: Possibility to get the line numbers?

2016-02-03 Thread Fabian Hueske
Hi Anastasiia, this is difficult because the input is usually read in parallel, i.e., an input file is split into several blogs which are independently read and processed by different threads (possibly on different machines). So it is difficult to have a sequential row number. If all rows have

Re: Hello, a question about Dashborad in Flink

2016-01-29 Thread Fabian Hueske
or > Dashboard? > > I did not find the network usage metric in it. > > Best, > Phil > > On Mon, Jan 25, 2016 at 5:06 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> You can start a job and then periodically request and store information >> about the run

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
; each HDFS file), right? > > > On Fri, Jan 29, 2016 at 11:56 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> The number of input splits does not depend on the number of files but on >> the number of HDFS blocks of all files. >> Reading a single file with 100 HDFS

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
d then > generate the InputSplits. Am I right? Or am I misunderstanding something? > > Best, > Flavio > > On Fri, Jan 29, 2016 at 11:14 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Flavio, >> >> using a default FileOutputFormat, Flink writes one

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
Hi Flavio, using a default FileOutputFormat, Flink writes one output file for each data sink task, i.e., as many files as the defined parallelism. The size of these files depends on the total output size and the distribution. If you write to HDFS, a file consists of one or more HDFS blocks.

Re: Flink stream data ordering/sequence

2016-01-29 Thread Fabian Hueske
Hi Sana, The feature you are looking for is called event time processing in Flink. These blog posts should help you to become familiar with the concepts: 1) Event-Time concepts: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/ 2) Windows in Flink:

Re: Window stream using timestamp key for time

2016-01-28 Thread Fabian Hueske
Hi Emmanuel, the feature you are looking for is called event time processing in Flink. These blog posts should help you to become familiar with the concepts: 1) Event-Time concepts: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/ 2) Windows in Flink:

Re: Mixing Batch & Streaming

2016-01-28 Thread Fabian Hueske
Hi, this is currently not support yet. However, this feature is on our roadmap and has been requested for a few times. So I hope somebody will pick it up soon. If the static data set is small enough, you can read the full data set (e.g., as a file) in the open method of FlatMapFunction, build a

Re: Task Manager metrics per job on Flink 0.9.1

2016-01-27 Thread Fabian Hueske
Hi, it is correct that the metrics are collected from the task managers. In Flink 0.9.1 the metrics are visualized as charts in the web dashboard. This visualization was removed when the dashboard was redesigned and updated for 0.10. but will be hopefully be added again. For Flink 0.9.1, the

Re: Reading Binary Data (Matrix) with Flink

2016-01-25 Thread Fabian Hueske
d reads. Also, > the file is replicated across nodes and the reading (mapping) happens only > once. > > Thank you, > Saliya > > On Mon, Jan 25, 2016 at 4:38 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Saliya, >> >> yes that is possible, how

Re: Hello, a question about Dashborad in Flink

2016-01-25 Thread Fabian Hueske
You can start a job and then periodically request and store information about the running job and vertices from using corresponding REST calls [1]. The data will be in JSON format. After the job finished, you can stop requesting data. Next you parse the JSON, extract the information you need and

Re: Reading Binary Data (Matrix) with Flink

2016-01-25 Thread Fabian Hueske
Hi Saliya, yes that is possible, however the requirements for reading a binary file from local fs are basically the same as for reading it from HDSF. In order to be able to start reading different sections of a file in parallel, you need to know the different starting positions. This can be done

Re: Actual byte-streams in multiple-node pipelines

2016-01-21 Thread Fabian Hueske
Hi Tal, you said that most processing will be done in external processes. If these processes are stateful, this might be hard to integrate with Flink's fault-tolerance mechanism. In principle, Flink requires two things to achieve exactly-once processing: 1) A data source that can be replayed from

Re: Flink Execution Plan

2016-01-14 Thread Fabian Hueske
@Christian: I don't think that is possible. There are quite a few things missing in the JSON including: - User function objects (Flink ships objects not class names) - Function configuration objects - Data types Best, Fabian 2016-01-14 16:02 GMT+01:00 lofifnc : > Hi

<    9   10   11   12   13   14   15   16   >