Hi Flavio,
I thought a bit about your proposal. I am not sure if it is actually
necessary to integrate a central source repository into Flink. It should be
possible to offer this as an external service which is based on the
recently added TableSource interface. TableSources could be extended to
orrect and
> custom runtime operators are not supposed to be implemented outside of
> Flink.
>
> Thanks,
>
> Simone
>
> 2016-04-29 21:32 GMT+02:00 Fabian Hueske <fhue...@gmail.com>:
>
>> Hi Simone,
>>
>> the GraphCreatingVisitor transforms the com
Yes, but be aware that your program runs with parallelism 1 if you do not
configure the parallelism.
2016-05-03 11:07 GMT+02:00 Punit Naik :
> Hi Stephen, Fabian
>
> setting "fs.output.always-create-directory" to true in flink-config.yml
> worked!
>
> On Tue, May 3, 2016
Did you specify a parallelism? The default parallelism of a Flink instance
is 1 [1].
You can set a different default parallelism in ./conf/flink-conf.yaml or
pass a job specific parallelism with ./bin/flink using the -p flag [2].
More options to define parallelism are in the docs [3].
[1]
Hi John,
S3 keys are configured via Hadoop's configuration files.
Check out the documentation for AWS setups [1].
Cheers, Fabian
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html
2016-05-02 20:22 GMT+02:00 John Sherwood :
> Hello all,
>
> I'm
Grouping a grouped dataset is not supported.
You can group on multiple keys: dataSet.groupBy(1,2).
Can you describe your use case if that does not solve the problem?
2016-05-02 10:34 GMT+02:00 Punit Naik :
> Hello
>
> I wanted to perform a groupBy on an already grouped
Yes, it looks like the connector only creates the connection once when it
starts and fails if the host is no longer reachable.
It should be possible to catch that failure and try to re-open the
connection.
I opened a JIRA for this issue (FLINK-3857).
Would you like to implement the improvement?
Have you checked the log files as well?
2016-05-01 14:07 GMT+02:00 subash basnet :
> Hello there,
>
> If anyone could help me know why the below *result* DataStream get's
> written as text, but not as csv?. As it's in a tuple format I guess it
> should be the same for both
Hi Nirmalya,
the solution with List.size() won't use a combiner and won't be efficient
for large data sets with large groups.
I would recommend to add a 1 and use GroupedDataSet.sum().
2016-05-01 12:48 GMT+02:00 nsengupta :
> Hello all,
>
> This is how I have moved
The slot configuration should depend on the complexity of jobs.
Since each slot runs a "slice" of a program, one slot might potentially
execute many concurrent tasks.
For complex jobs you should allocate more than one core for each slot.
2016-05-02 10:12 GMT+02:00 Robert Metzger
I was going to increase the Job Manager heap to 3 GB and maybe change some
> gc setting.
> Do you think I should increase also the akka timeout or other things?
>
> On Thu, Apr 28, 2016 at 2:06 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hmm, 113k splits
This is 21),(23,This is 23),(25,This is 25),(27,This is
> 27),(29,This is 29)
> ! 36: (20,This is 20),(22,This is 22),(24,This is 24),(26,This is
> 26),(28,This is 28)
>
>
> And if you can give a bit more info on why will I have latency issues in a
> case of varying rate of arri
sted the wrong TM heap size...that is indeed 3Gb (
> taskmanager.heap.mb:512)
>
> Best,
> Flavio
>
> On Thu, Apr 28, 2016 at 1:29 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Is the problem reproducible?
>> Maybe the SplitAssigner gets stuck somehow, but I've n
gt; Do you see anything sospicious?
>
> Thanks for the support,
> Flavio
>
> On Thu, Apr 28, 2016 at 11:54 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> I checked the input format from your PR, but didn't see anything
>> suspicious.
>>
>> It is def
Hi Konstantin,
if you do not need a deterministic grouping of elements you should not use
a keyed stream or window.
Instead you can do the lookups in a parallel flatMap function. The function
would collect arriving elements and perform a lookup query after a certain
number of elements arrived
Hi Robert,
Function configuration via a Configuration object and the open method is an
artifact from the past.
The recommended way is to configure the function object via the
constructor.
Flink serializes the function object and ships them to the workers for
execution. So the state of a function
I checked the input format from your PR, but didn't see anything
suspicious.
It is definitely OK if the processing of an input split tasks more than 10
seconds. That should not be the cause.
It rather looks like the DataSourceTask fails to request a new split from
the JobManager.
2016-04-28 9:37
Hi,
Table API and SQL for streaming are work in progress. A first version which
supports projection, filter, and union is merged to the master branch.
Under the hood, Flink uses Calcite to optimize and translate Table API and
SQL queries.
Best, Fabian
2016-04-27 14:27 GMT+02:00 Zhangrucong
Hi Timur,
I had a look at the plan you shared.
I could not find any flow that branches and merges again, a pattern which
is prone to cause a deadlocks.
However, I noticed that the plan performs a lot of partitioning steps.
You might want to have a look at forwarded field annotations which can
Hi Ken,
at the moment, there are just two parameters to control the parallelism of
Flink operators generated by the Cascading-Flink connector.
The parameters are:
- flink.num.sourceTasks to specify the parallelism of source tasks.
- flink.num.shuffleTasks to specify the parallelism of all
Hi Biplob,
Flink is a distributed, data parallel system which means that there are
several instances of you ReduceFunction running in parallel, each with its
own timestamp counter.
If you want to have a unique timestamp, you have to set the parallelism of
the reduce operator to 1, but then the
Actually, memory should not be a problem since the full data set would not
be materialized in memory.
Flink has a streaming runtime so most of the data would be immediately
filtered out.
However, reading the whole file causes of course a lot of unnecessary IO.
2016-04-26 17:09 GMT+02:00 Biplob
Hi Timur,
a TaskManager may run as many subtasks of a Map operator as it has slots.
Each subtask of an operator runs in a different thread. Each parallel
subtask of a Map operator has its own MapFunction object, so it should be
possible to use a lazy val.
However, you should not use static
Hi Biplop,
you can also implement a generic IF that wraps another IF (such as a
CsvInputFormat).
The wrapping IF forwards all calls to the wrapped IF and in addition counts
how many records were emitted (how often InputFormat.nextRecord() was
called).
Once the count arrives at the threshold, it
Hi Konstantin,
this exception is thrown if you do not set the time characteristic to event
time and assign timestamps.
Please try to add
> env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
after you obtained the StreamExecutionEnvironment.
Best, Fabian
2016-04-22 15:47 GMT+02:00
to make sure that your operator can recover from failures.
Cheers, Fabian
2016-04-21 23:16 GMT+02:00 Jonathan Yom-Tov <jon.yom...@gmail.com>:
> Thanks. Any pointers on how to do that? Or code examples which do similar
> things?
>
> On Thu, Apr 21, 2016 at 10:30 PM, F
Yes, sliding windows are different.
You want to evaluate the window whenever a new element arrives or an
element leaves because 5 secs passed since it entered the window, right?
I think that should be possible with a GlobalWindow, a custom Trigger which
holds state about the time when each
Hi Simone,
in Flink 1.0.x, the Table API does not support reading external data, i.e.,
it is not possible to read a CSV file directly from the Table API.
Tables can only be created from DataSet or DataStream which means that the
data is already converted into "Flink types".
However, the Table
Hi Yifei,
I think this has not been done before. At least I am not aware of anybody
running Flink in cluster mode on Windows.
In principle this should work. It is possible to start a local instance on
Windows (start-local.bat) and to locally execute Flink programs on this
instance using the
nbounded all elements that arrive in the first window
> are grouped/partitioned by keys and aggregated and so on until no more
> streams left. The global result then has the aggregated key/value pairs.
>
> Kind Regards,
> Ravinder Kaur
>
>
>
> On Wed, Apr 20, 2016 at 12:12 PM,
Hi Piyush,
if you explicitly set a trigger, the default trigger of the window is
replaced.
In your example, the time trigger is replaced by the count trigger, i.e.,
the window is only evaluated after the 100th element was received.
This blog post discusses windows and triggers [1].
Best, Fabian
Hi Ravinder,
your drawing is pretty much correct (Flink will inject a combiner between
flat map and reduce which locally combines records with the same key).
The partitioning between flat map and reduce is done with hash partitioning
by default. However, you can also define a custom partitioner
Hi Ovidiu,
Hash tables are currently used for joins (inner & outer) and the solution
set of delta iterations.
There is a pending PR that implements a hash table for partial aggregations
(combiner) [1] which should be added soon.
Joins (inner & outer) are already implemented as Hybrid Hash joins
t; Regards,
>>>>> Robert
>>>>>
>>>>> [1]
>>>>> https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink
>>>>>
>>>>>
>>>>> On Mon, Oct 19, 2015 at 4:10 PM, Matthias J. Sax <mj...@apache.org
>>>>>
I would go with an outer join as Stefano suggested.
Outer joins can be executed as hash joins which will probably be more
efficient than using a sort based groupBy/reduceGroup.
Also outer joins are a more intuitive and simpler, IMO.
2016-04-07 12:35 GMT+02:00 Stefano Baghino
Hopping windows is a term used on the Apache Calcite website [1]. In Flink
terms, hopping windows are sliding windows.
Cheers, Fabian
[1] http://calcite.apache.org/docs/stream.html
Von: Ufuk Celebi
Gesendet: Montag, 28. März 2016 12:40
An: user@flink.apache.org
Betreff: Re: Window Support in
r...@teamaol.com> wrote:
>
>> Fabian,
>>
>> I'll try extending InputFormat as you suggested and will create a JIRA
>> issue as well.
>>
>> I also have an AvroGenericRecordInput format class that I would like to
>> contribute once I have time to clean i
Hi,
no, this is currently not supported. However, I agree this would be a very
valuable addition to the FileInputFormat.
Would you mind opening a JIRA issue with your suggestions?
Until this is added to Flink, it can be implemented as a custom InputFormat
based on FileInputFormat by overriding
Hi,
right now there is no way to sequentially execute the input tasks. Flink's
FileInputFormat does also not support multiple paths out of the box.
However, it is certainly possible to extend the FileInputFormat such that
this is possible.
You would need to override / extend the
Hi,
did find the documentation for configuring the parallelism [1]?
It explains how to set the parallelism on different levels: Cluster, Job,
Task.
Best, Fabian
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#parallel-execution
2016-03-18 13:34 GMT+01:00
eInBytes), along with OS cache.
>
> I would like parameters like:
> taskmanager.off-heap.size or taskmanager.off-heap.fraction
> taskmanager.off-heap.enabled true or false
> and same for heap.
>
> Thanks for clarification.
>
> Best,
> Ovidiu
>
>
> On 16 Mar 2016, at 13:43,
Hi Bart,
if you run a fold function on a keyed stream without a window, there is no
way to remove the key and the folded value.
You will eventually run out of memory if your key space is continuously
growing.
If you apply a fold function in a window on a keyed stream you can bound
the "lifetime"
se-1.0/setup/config.html#managed-memory
>
> Best,
> Ovidiu
>
> On 16 Mar 2016, at 12:13, Fabian Hueske <fhue...@gmail.com> wrote:
>
> Hi Ovidiu,
>
> the parameters to configure the amount of managed memory
> (taskmanager.memory.size,
> taskmanager.memory.
-
> Bart van Deenen
> bartvandee...@fastmail.fm
>
>
>
> On Fri, Mar 18, 2016, at 11:54, Fabian Hueske wrote:
>
> Hi Bart,
> if you run a fold function on a keyed stream without a window, there is no
> way to remove the key and the folded value.
> You will eve
Hi Ken,
you can open an issue on the Github repository or send a mail to me.
Thanks,
Fabian
2016-03-17 23:07 GMT+01:00 Ken Krugler :
> Hi list,
>
> What's the right way to provide input on the training exercises
>
Hi Ovidiu,
the parameters to configure the amount of managed memory
(taskmanager.memory.size,
taskmanager.memory.fraction) are valid for on and off-heap memory.
Have you tried these parameters and didn't they work as expected?
Best, Fabian
2016-03-16 11:43 GMT+01:00 Ovidiu-Cristian MARCU <
Hi Mengqi,
I did not completely understand your use case.
If you would like to use a composite key (a key with multiple fields) there
are two alternatives:
- use a tuple as key type. This only works if all records have the same
number of key fields. Tuple serialization and comparisons are very
Hi Ovidiu,
putting the CompactingHashTable aside, all data structures and algorithms
that use managed memory can spill to disk if data exceeds memory capacity.
It was a conscious choice to not let the CompactingHashTable spill. Once
the solution set hash table is spilled, (parts of) the hash
Hi Zach,
at the moment, accumulators are not checkpointed and reset if if a failed
task is restarted.
Best, Fabian
2016-03-15 17:27 GMT+01:00 Zach Cox :
> Are accumulators stored in checkpoint state? If a job fails and restarts,
> are all accumulator values lost, or are they
Hi Benjamin,
Flink reads data usually in parallel. This is done by splitting the input
(e.g., a file) into several input splits. Each input split is independently
processed. Since splits are usually concurrently processed by more than one
task, Flink does not care about the order by default.
You
Hi,
I haven't used protobuf to serialize Kafka events but this blog post (+ the
linked repository) shows how to write data from Flink into Elasticsearch:
-->
https://www.elastic.co/blog/building-real-time-dashboard-applications-with-apache-flink-elasticsearch-and-kibana
Hope this helps,
Fabian
Hi Philippe,
I am not aware of anybody using Directory Monitor with Flink. However, the
application you described sounds reasonable and I think it should be
possible to implement that with Flink.
You would need to implement a SourceFunction that forwards events from DM
to Flink or you push the
Hi Abhi,
I have used Flink on EMR via YARN a couple of times without problems.
I started a Flink YARN session like this:
./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096
This will start five YARN containers (1 JobManager with 1024MB, 4
Taskmanagers with 4096MB). See more config options in the
Hi Sourigna,
you are using the formula correctly: #cores should to be translated into
slots per taskmanager (TM), and #machines into number of TMs. So 36 ^ 2 *
10 * 4 = 51840 appears to be right.
The constant 4 refers to the total number of concurrently active full
network shuffles (partitioning
Hi Subash,
the KMeans implementation in Flink is meant to be a simple toy example and
should not used for serious analysis tasks.
It shows how the DataSet API works by implementing a well-known algorithm.
Nonetheless, the example can be easily extended to work for three or more
dimensions.
You
till getting the same issue.
>
> Regards,
> MArcela.
>
>
> On 29.02.2016 16:44, Fabian Hueske wrote:
>
>> Hi Marcela,
>>
>> do you run the algorithm in both setups with the same parallelism?
>>
>> Best, Fabian
>>
>> 2016-02-26 16:52 GMT+
possible for DataSet (batch) programs.
Hope this helps, Fabian
2016-02-22 11:26 GMT+01:00 Fabian Hueske <fhue...@gmail.com>:
> Hi Welly,
>
> sorry for the late response.
>
> The number of network buffers primarily depends on the maximum parallelism
> of your job.
> The given
Hi Welly,
sorry for the late response.
The number of network buffers primarily depends on the maximum parallelism
of your job.
The given formula assumes a specific cluster configuration (1 task manager
per machine, one parallel task per CPU).
The formula can be translated to:
ouble value representing
> the average in the previous example.
>
>
> On Tue, Feb 16, 2016 at 3:47 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> You can use so-called BroadcastSets to send any sufficiently small
>> DataSet (such as a computed average) to any other f
doing multiple reads of the input data to create
> the same dataset?
>
> Thank you,
> saliya
>
> On Tue, Feb 16, 2016 at 12:03 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Yes, if you implement both maps in a single job, data is read once.
>>
>>
>
> Assuming field0 is Int and has unique values 1,2,3&4.
>
> Srikanth
>
>
> On Mon, Feb 15, 2016 at 6:20 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Srikanth,
>>
>> DataSet.partitionBy() will partition the data on the declared partition
>>
at everything goes as
> single job in Flink, so data read happens only once?
>
> Thanks,
> Saliya
>
> On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> It is not possible to "pin" data sets in memory, yet.
>> Howe
.
>
> Any chance you might have an example on how to define a data flow with
> Flink?
>
>
>
> On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> It is not possible to "pin" data sets in memory, yet.
>> However, y
Hi,
it looks like you are executing two distinct Flink jobs.
DataSet.count() triggers the execution of a new job. If you have an
execute() call in your program, this will lead to two Flink jobs being
executed.
It is not possible to share state among these jobs.
Maybe you should add a custom
Hi Javier,
Keys is an internal class and was recently moved to a different package.
So it appears like your Flink dependencies are not aligned to the same
version.
We also added Scala version identifiers to all our dependencies which
depend on Scala 2.10.
For instance, flink-scala became
Hi Michal,
If I got your requirements right, you could try to solve this issue by
serving the updates through a regular DataStream.
You could add a SourceFunction which periodically emits a new version of
the cache and a CoFlatMap operator which receives on the first input the
regular streamed
Hi,
This stacktrace looks really suspicious.
It includes classes from the submission client (CLIClient), optimizer
(JobGraphGenerator), and runtime (KryoSerializer).
Is it possible that you try to start a new Flink job inside another job?
This would not work.
Best, Fabian
Hi Srikanth,
DataSet.partitionBy() will partition the data on the declared partition
fields.
If you append a DataSink with the same parallelism as the partition
operator, the data will be written out with the defined partitioning.
It should be possible to achieve the behavior you described using
Hi Flavio,
If I got it right, you can use a FullOuterJoin.
It will give you both elements on a match and otherwise a left or a right
element and null.
Best, Fabian
2016-02-12 16:48 GMT+01:00 Flavio Pompermaier :
> Hi to all,
>
> I have a use case where I have to merge 2
Hi Subash,
how is findOutliers implemented?
It might be that you mix-up local and cluster computation. All DataSets are
processed in the cluster. Please note the following:
- ExecutionEnvironment.fromCollection() transforms a client local
connection into a DataSet by serializing it and sending
Hi Flavio,
I did not completely understand which objects should go where, but here are
some general guidelines:
- early filtering is mostly a good idea (unless evaluating the filter
expression is very expensive)
- you can use a flatMap function to combine a map and a filter
- applying multiple
e") in order to generate the typed
> dataset based.
> Which one do you think is the best solution?
>
> On Wed, Feb 10, 2016 at 9:40 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Flavio,
>>
>> I did not completely understand which objects should go
ntWithDistance.f1.f0, elementWithDistance.f1.f1,
> true);
> } else {
> newElement.setFields(elementWithDistance.f1.f0, elementWithDistance.f1.f1,
> false);
> }
> finalElements.add(newElement);
> }
> }
> return finalElements;
> }
>
> I have attached here
Hi Ravinder,
please have a look at the configuration documentation:
-->
https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager
Best, Fabian
2016-02-10 13:55 GMT+01:00 Ravinder Kaur :
> Hello All,
>
> I need to know the range of
Hi,
glad you could resolve the POJO issue, but the new error doesn't look
right.
The CO_GROUP_RAW strategy should only be used for programs that are
implemented against the Python DataSet API.
I guess that's not the case since all code snippets were Java so far.
Can you post the full stacktrace
Hi,
where did you observe the duplicates, within Flink or in Kafka?
Please be aware that the Flink Kafka Producer does not provide exactly-once
consistency. This is not easily possible because Kafka does not support
transactional writes yet.
Flink's exactly-once guarantees are only valid within
What is the type of sessionId?
It must be a key type in order to be used as key. If it is a generic class,
it must implement Comparable to be used as key.
2016-02-09 11:53 GMT+01:00 Dominique Rondé :
> The fields in SourceA and SourceB are private but have public
String is perfectly fine as key.
Looks like SourceA / SourceB are not correctly identified as Pojos.
2016-02-09 14:25 GMT+01:00 Dominique Rondé :
> Sorry, i was out for lunch. Maybe the problem is that sessionID is a
> String?
>
> public abstract class Parent{
>
.runtime.taskmanager.Task.run(Task.java:584)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.Exception: Unsupported driver strategy for join
> driver: CO_GROUP_RAW
> at
> org.apache.flink.runtime.operators.JoinDriver.prepare(JoinDriver.java:193)
> at org.apache.flink.runtime.operators.BatchTa
Hi,
please try to replace
DataSet ds = env.createInput(sif);
by
DataSet ds = env.createInput(sif,
ValueTypeInfo.SHORT_VALUE_TYPE_INFO);
Best, Fabian
2016-02-08 19:33 GMT+01:00 Saliya Ekanayake :
> Till,
>
> I am still having trouble getting this to work. Here's my code (
>
pache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:169)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
> at java.lang.Thread.run(Thread.java:745)
>
> On Mon, Feb 8, 2016 at 3:50 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi,
>>
>> please try to replace
>&
You can get the end time of a window from the TimeWindow object which is
passed to the AllWindowFunction. This is basically a window ID / index.
I would go for a custom output sink which writes records to files based on
their timestamp.
IMO, this would be cleaner & easier than implementing the
Hi Flavio,
we use tags to identify releases. The "release-0.10.1" tag, refers to the
code that has been released as Flink 0.10.1.
The "release-0.10" branch is used to develop 0.10 releases. Currently, it
contains Flink 0.10.1 and additionally a few more bug fix commits. We will
fork off this
Hi Arnauld,
in a previous mail you said:
"Note that I did not rebuild & reinstall flink, I just used a 0.10-SNAPSHOT
compiled jar submitted as a batch job using the "0.10.0" flink installation"
This will not fix the Netty version error. You need to install a new Flink
version or submit the Flink
Hi,
1) At the moment, state is kept on the JVM heap in a regular HashMap.
However, we added an interface for pluggable state backends. State backends
store the operator state (Flink's built-in window operators are based on
operator state as well). A pull request to add a RocksDB backend (going
as hoping it would suffice…
>
> However, I’ve just recompiled everything and ran with a real 0.10.1
> snapshot and everything worked at an astounding speed with a reasonable
> memory amount.
>
> Thanks for the great work and the help, as always,
>
> Arnaud
>
>
&g
Hi Anastasiia,
this is difficult because the input is usually read in parallel, i.e., an
input file is split into several blogs which are independently read and
processed by different threads (possibly on different machines). So it is
difficult to have a sequential row number.
If all rows have
or
> Dashboard?
>
> I did not find the network usage metric in it.
>
> Best,
> Phil
>
> On Mon, Jan 25, 2016 at 5:06 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> You can start a job and then periodically request and store information
>> about the run
; each HDFS file), right?
>
>
> On Fri, Jan 29, 2016 at 11:56 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> The number of input splits does not depend on the number of files but on
>> the number of HDFS blocks of all files.
>> Reading a single file with 100 HDFS
d then
> generate the InputSplits. Am I right? Or am I misunderstanding something?
>
> Best,
> Flavio
>
> On Fri, Jan 29, 2016 at 11:14 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Flavio,
>>
>> using a default FileOutputFormat, Flink writes one
Hi Flavio,
using a default FileOutputFormat, Flink writes one output file for each
data sink task, i.e., as many files as the defined parallelism.
The size of these files depends on the total output size and the
distribution. If you write to HDFS, a file consists of one or more HDFS
blocks.
Hi Sana,
The feature you are looking for is called event time processing in Flink.
These blog posts should help you to become familiar with the concepts:
1) Event-Time concepts:
http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
2) Windows in Flink:
Hi Emmanuel,
the feature you are looking for is called event time processing in Flink.
These blog posts should help you to become familiar with the concepts:
1) Event-Time concepts:
http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
2) Windows in Flink:
Hi,
this is currently not support yet. However, this feature is on our roadmap
and has been requested for a few times.
So I hope somebody will pick it up soon.
If the static data set is small enough, you can read the full data set
(e.g., as a file) in the open method of FlatMapFunction, build a
Hi,
it is correct that the metrics are collected from the task managers.
In Flink 0.9.1 the metrics are visualized as charts in the web dashboard.
This visualization was removed when the dashboard was redesigned and
updated for 0.10. but will be hopefully be added again.
For Flink 0.9.1, the
d reads. Also,
> the file is replicated across nodes and the reading (mapping) happens only
> once.
>
> Thank you,
> Saliya
>
> On Mon, Jan 25, 2016 at 4:38 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Saliya,
>>
>> yes that is possible, how
You can start a job and then periodically request and store information
about the running job and vertices from using corresponding REST calls [1].
The data will be in JSON format.
After the job finished, you can stop requesting data.
Next you parse the JSON, extract the information you need and
Hi Saliya,
yes that is possible, however the requirements for reading a binary file
from local fs are basically the same as for reading it from HDSF.
In order to be able to start reading different sections of a file in
parallel, you need to know the different starting positions. This can be
done
Hi Tal,
you said that most processing will be done in external processes. If these
processes are stateful, this might be hard to integrate with Flink's
fault-tolerance mechanism.
In principle, Flink requires two things to achieve exactly-once processing:
1) A data source that can be replayed from
@Christian: I don't think that is possible.
There are quite a few things missing in the JSON including:
- User function objects (Flink ships objects not class names)
- Function configuration objects
- Data types
Best, Fabian
2016-01-14 16:02 GMT+01:00 lofifnc :
> Hi
1301 - 1400 of 1535 matches
Mail list logo