Re: Lost data when resuming from savepoint

2017-10-09 Thread Fabian Hueske
Hi Jose, I had a look at your program but did not spot anything. The query is a simple "SELECT FROM WHERE" query that does not have any state. So the only state is the state of the Kafka source, i.e, the offset. How much time did pass between taking the savepoint and resuming? Did you see any

Re: Failing to recover once checkpoint fails

2017-10-09 Thread Fabian Hueske
the default true. I bring this up to see whether flink will in any >>>>>> circumstance drive consumption on the kafka perceived offset rather than >>>>>> the one in the checkpoint. >>>>>> >>>>>> * The state.backend

Re: Unusual log message - Emitter thread got interrupted

2017-10-06 Thread Fabian Hueske
Hi Ken, I don't have much experience with streaming iterations. Maybe Aljoscha (in CC) has an idea what is happening and if it can be prevented. Best, Fabian 2017-10-05 1:33 GMT+02:00 Ken Krugler : > Hi all, > > I’ve got a streaming topology with an iteration, and

Re: At end of complex parallel flow, how to force end step with parallel=1?

2017-10-06 Thread Fabian Hueske
e a local variable calculated by > ...distinct().count() in a downstream flow. The second flow indeed set > parallelism correctly! Thank you for the help. :) > > On Wed, Oct 4, 2017 at 8:01 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Garrett, >> >>

Re: Calculating metrics real time

2017-10-05 Thread Fabian Hueske
com>: > Thanks for the response. So, i am guessing windows in flink will store the > records in memory before processing them. Correct? > > Rahul Raj > > On Oct 5, 2017 17:50, "Fabian Hueske" <fhue...@gmail.com> wrote: > >> Hi, >> >> I'd su

Re: Kafka Producer writeToKafkaWithTimestamps; name, uid, parallelism fails

2017-10-05 Thread Fabian Hueske
Hi Yunus, thanks for reporting this problem. I opened the JIRA issue FLINK-7764 [1] for this. Thanks, Fabian [1] https://issues.apache.org/jira/browse/FLINK-7764 2017-10-05 13:38 GMT+02:00 Yunus Olgun : > Hi, > > I am using Flink 1.3.2. When I try to use KafkaProducer with

Re: At end of complex parallel flow, how to force end step with parallel=1?

2017-10-04 Thread Fabian Hueske
Hi Garrett, that's strange. DataSet.reduceGroup() will create a non-parallel GroupReduce operator. So even without setting the parallelism manually to 1, the operator should not run in parallel. What might happen though is that a combiner is applied to locally reduce the data before it is shipped

Re: Clean GlobalWidnow state

2017-09-29 Thread Fabian Hueske
Thanks for creating the JIRA issue! Best, Fabian 2017-09-20 12:26 GMT+02:00 gerardg : > I have prepared a repo that reproduces the issue: > https://github.com/GerardGarcia/flink-global-window-growing-state > > Maybe this way it is easier to spot the error or we can determine

Re: JDBC table source

2017-09-26 Thread Fabian Hueske
. Looks like that is not possible? > > On Tue, Sep 26, 2017 at 3:24 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Mohit, >> >> no, a JdbcTableSource does not exist yet. However, since there is a >> JdbcInputFormat it should not be hard to wrap t

Re: JDBC table source

2017-09-26 Thread Fabian Hueske
Hi Mohit, no, a JdbcTableSource does not exist yet. However, since there is a JdbcInputFormat it should not be hard to wrap that in a TableSource. However, this would rather be a batch TableSource in the sense that it would just return the data that the query returns. Once all data is read it

Re: Classpath/ClassLoader issues

2017-09-20 Thread Fabian Hueske
Here's the pull request that hopefully fixes your issue: https://github.com/apache/flink/pull/4690 Best, Fabian 2017-09-20 16:15 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > Hi Garrett, > > I think I identified the problem. > You said you put the Hive/HCat dependencies into

Re: Classpath/ClassLoader issues

2017-09-20 Thread Fabian Hueske
loadClass(JavaUtils.java:78) > at org.apache.hadoop.hive.common.JavaUtils.loadClass(JavaUtils.74) > at org.apache.hive.hcatalog.mapreduce.FosterStorageHandler. t>(FosterStorageHandler.68) > at org.apache.hive.hcatalog.common.HCatUtil.getStorageHandler(H > CatUtil.java:404) > > > Thank you all for any help.

Re: Rest API cancel-with-savepoint: 404s when passing path as target-directory

2017-09-19 Thread Fabian Hueske
Hi Emily, thanks for reaching out. I'm not familiar with the details of the Rest API but Ufuk (in CC) might be able to help you. Best, Fabian 2017-09-19 10:23 GMT+02:00 Emily McMahon : > I've tried every combination I can think of to pass an s3 path as the > target

Re: Flink kafka consumer that read from two partitions in local mode

2017-09-19 Thread Fabian Hueske
Hi Tovi, your code looks OK to me. Maybe Gordon (in CC) has an idea what is going wrong. Just a side note: you don't need to set the parallelism to 2 to read from two partitions. A single consumer instance reads can read from multiple partitions. Best, Fabian 2017-09-19 17:02 GMT+02:00 Sofer,

Re: Classpath/ClassLoader issues

2017-09-19 Thread Fabian Hueske
Hi Garrett, Flink distinguishes between two classloaders: 1) the system classloader which is the main classloader of the process. This classloader loads all jars in the ./lib folder and 2) the user classloader which loads the job jar. AFAIK, the different operators do not have distinct

Re: Load distribution through the cluster

2017-09-19 Thread Fabian Hueske
There is no notion of "full" in Flink except that one slot will run at most one subtask of each operator. The scheduling depends on the structure of the job, the parallelism of the operators, and the number of slots per TM. It's hard to tell without knowing the details. 2017-09-19 11:57

Re: Clean GlobalWidnow state

2017-09-19 Thread Fabian Hueske
If this would be the case, that would be a bug in Flink. As I said before, your implementation looked good to me. All state of window and trigger should be wiped if the trigger returns FIRE_AND_PURGE (or PURGE) and it's clean() method is correctly implemented. I'll CC Aljoscha again for his

Re: Clean GlobalWidnow state

2017-09-19 Thread Fabian Hueske
Hi Gerard, I had a look at your Trigger implementation but did not spot something suspicious that would cause the state size to grow. However, I notices a few things that can be improved: - use ctx.getCurrentProcessingTime instead of System.currentTimeMillis to make the Trigger easier to test

Re: Taskmanager unable to rejoin job manager

2017-09-18 Thread Fabian Hueske
Hi Marcus, thanks for reaching out with your problem. I'm not very experienced with the HA setup, but Till (in CC) might be able to help you. Best, Fabian 2017-09-14 16:57 GMT+02:00 Marcus Clendenin : > Hi all, > > > > I am having an issue where one of our task managers

Re: Securing Flink Monitoring REST API

2017-09-18 Thread Fabian Hueske
Hi, sorry for the late response. Flink uses Netty for network communication which supports SSL client authentication. I haven't tried it myself, but would think that this should work in Flink as well if you configure the certificates correctly. We should update the docs to cover this aspect.

Re: ARIMA Model

2017-09-18 Thread Fabian Hueske
Hi, no, such a model is not available. Best, Fabian 2017-09-17 20:01 GMT+02:00 Madhukar Thota : > is there a ARIMA(Autoregressive integrated moving average > ) > module available in Flink machine

Re: Streaming API has a long delay at the beginning of the process.

2017-09-15 Thread Fabian Hueske
knows more about the lifecycle of tasks. > Thank you very much for your kindness. > > Regards, Yuta > > On 2017/09/14 19:32, Fabian Hueske wrote: > >> Hi, >> >> If I understand you correctly, the problem is only for the first events >> that are processed. &g

Re: Table API and registration of DataSet/DataStream

2017-09-14 Thread Fabian Hueske
; > > Is this correct? > Moreover, fromDataset requires fieldNames to be a comma separated String, > why not support also fieldNames as String[]...? > > Best, > Flavio > > > On Thu, Sep 14, 2017 at 3:43 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >

Re: Table API and registration of DataSet/DataStream

2017-09-14 Thread Fabian Hueske
<pomperma...@okkam.it>: > Yes I can do that of course. > What I need is basically the possibility to translate a where clause to a > filter function. Is there any utility class that does that in Flink? > > On 9 Sep 2017 21:54, "Fabian Hueske" <fhue...@gmail.com>

Re: ETL with changing reference data

2017-09-14 Thread Fabian Hueske
Hi Peter, in principle, joining the data stream with the reference data would be the most natural approach to enrich data. However, stream joins for the Table API / SQL are currently under development and not available yet. You can of course try to solve the issue manually using UDFs but this

Re: DataSet: partitionByHash without materializing/spilling the entire partition?

2017-09-14 Thread Fabian Hueske
.reduceGroup(new FirstReducer<>(3)) > > > Again, thank you very much for your support! > > Best, > Urs > > -- > Urs Schönenberger - urs.schoenenber...@tngtech.com > > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring > Geschäftsführer:

Re: Streaming API has a long delay at the beginning of the process.

2017-09-14 Thread Fabian Hueske
Hi, If I understand you correctly, the problem is only for the first events that are processed. AFAIK, Flink lazily instantiates its operators which means that a source task starts to consume records from Kafka before the subsequent tasks have been started. That's why the latency of the first

Re: heap dump shows StoppableSourceStreamTask retained by java.lang.finalizer

2017-09-14 Thread Fabian Hueske
Hi Steven, thanks for reporting this issue. Looping in Till who's more familiar with the task lifecycles. Thanks, Fabian 2017-09-12 7:08 GMT+02:00 Steven Wu : > Hi , > > I was using Chaos Monkey to test Flink's behavior against frequent killing > of task manager nodes. I

Re: Keyed function type erasure problem.

2017-09-14 Thread Fabian Hueske
Hi, The problem is that Flink cannot determine the key type produced by the KeySelector because the type is a generic. The type information is lost at runtime due to Java's type erasure. You can try to implement the ResultTypeQueryable interface with the KeySelector and tell Flink the key type

Re: Table API and registration of DataSet/DataStream

2017-09-09 Thread Fabian Hueske
Hi Flavio, I tried to follow your example. If I got it right, you would like to change the registered table by assigning a different DataStream to the original myDs variable. With registerDataStream("test", myDs, ...) you don't register the variable myDs as a table but it's current value, i.e.,

Re: Fwd: HA : My job didn't restart even if task manager restarted.

2017-09-08 Thread Fabian Hueske
Hi, sorry for the late response! I'm not familiar with the details of the failure recovery but Till (in CC) knows the code in depth. Maybe he can figure out what's going on. Best, Fabian 2017-09-06 5:35 GMT+02:00 sunny yun : > I am still struggling to solve this problem.

Re: Flink Job Deployment (Not enough resources)

2017-09-08 Thread Fabian Hueske
Hi Rinat, No, Flink does not have a switch to immediately cancel a job if it cannot allocate enough resources. Maybe YARN has a configuration parameter to define a timeout after which a job is canceled if no resource become available. 2017-09-04 13:29 GMT+02:00 Rinat :

Re: Flink Job Deployment

2017-09-08 Thread Fabian Hueske
Hi Rinat, no, this is unfortunately not possible. When a job is submitted, all required JARs are copied into an HDFS location that's job-specific. Best, Fabian 2017-09-04 13:11 GMT+02:00 Rinat : > Hi folks ! > I’ve got a question about running flink job on the top of

Re: Does Flink has a JDBC server where I can submit Calcite Streaming Queries?

2017-09-08 Thread Fabian Hueske
ia REST or > whatever mechanism? Forgot even, the result set once I know there is a way > to submit Adhoc queries I can figure out a way to write to Kafka. > > Thanks, > kant > > On Fri, Sep 8, 2017 at 12:52 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Ka

Re: Does Flink has a JDBC server where I can submit Calcite Streaming Queries?

2017-09-08 Thread Fabian Hueske
Hi Kant, no, there is no such functionality. I'm also not sure how well streaming would work together with the JDBC interface. JDBC has not been designed for continuous streaming queries, i.e., queries that never terminate. Challenges would be to have an infinite, streamable ResultSet (which

Re: State Maintenance

2017-09-08 Thread Fabian Hueske
instances. > > Thanks. > > On Thu, Sep 7, 2017 at 12:44 AM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Navneeth, >> >> there's a lower level state interface that should address your >> requirements: OperatorStateStore.getUnionListState() >>

Re: Additional data read inside dataset transformations

2017-09-07 Thread Fabian Hueske
Hi, traditionally, you would do a join, but that would mean to read all Parquet files that might contain relevant data which might be too much. If you want to read data from within a user function (like GroupReduce), you are pretty much up to your own. You could create a HadoopInputFormat

Re: State Maintenance

2017-09-07 Thread Fabian Hueske
Hi Navneeth, there's a lower level state interface that should address your requirements: OperatorStateStore.getUnionListState() This union list state is similar to the regular operator list state, but instead of splitting the list for recovery and giving out splits to operator instance, it

Re: Apache Phenix integration

2017-09-06 Thread Fabian Hueske
Hi, According to the JavaDocs of java.sql.Connection, commit() will throw an exception if the connection is in auto commit mode which should be the default. So adding this change to the JdbcOutputFormat seems a bit risky. Maybe the Phoenix JDBC connector does not enable auto commits by default

Re: DataSet: partitionByHash without materializing/spilling the entire partition?

2017-09-06 Thread Fabian Hueske
anching and merging again? Maybe you can share the JSON plan (ExecutionEnvironment.getExecutionPlan())? Thanks, Fabian 2017-09-06 14:41 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > btw. not sure if you know that you can visualize the JSON plan returned by > ExecutionEnvironment.getExecutionPla

Re: DataSet: partitionByHash without materializing/spilling the entire partition?

2017-09-06 Thread Fabian Hueske
btw. not sure if you know that you can visualize the JSON plan returned by ExecutionEnvironment.getExecutionPlan() on the website [1]. Best, Fabian [1] http://flink.apache.org/visualizer/ 2017-09-06 14:39 GMT+02:00 Fabian Hueske <fhue...@gmail.com>: > Hi Urs, > > a hash-par

Re: DataSet: partitionByHash without materializing/spilling the entire partition?

2017-09-06 Thread Fabian Hueske
Hi Urs, a hash-partition operator should not spill. In general, DataSet plans aim to be as much pipelined as possible. There are a few cases when spilling happens: - full sort with not sufficient memory - hash-tables that need to spill (only in join operators) - range partitioning to compute a

Re: Union limit

2017-09-06 Thread Fabian Hueske
de segment: > listOfDataSet > .sliding(60,60) > .map(dsg => dsg.reduce((ds1,ds2) => ds1.union(ds2)).map(new IdMapper())) > //There is an iterator of DataSet > .reduce((dsg1,dsg2) => dsg1.union(dsg2)) // Here I got the exception > .map(finalDataSet => ... some transformat

Re: count + aggragation

2017-09-04 Thread Fabian Hueske
Hi Alieh, I'm not aware of a solution to the first problem, but for the second issue you should use mayBy() instead of max(). Best, Fabian 2017-09-04 16:08 GMT+02:00 Alieh : > Hello all, > > 1st question: > Is there any way to know the count or the content of

Re: Securing Flink Monitoring REST API

2017-09-04 Thread Fabian Hueske
Hi, you can configure SSL for Flink's network communication [1] (see jobmanager.web.ssl.enabled). However, Flink does not manage different user accounts or allows to grant permissions yet. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/security-ssl.html

Re: How to fill flink's datastream

2017-09-04 Thread Fabian Hueske
Hi Andrea, a MapFunction calls its map() function for each stream element and returns exactly one result value. MapFunctions are used for 1-to-1 transformations. The returns() method allows to specify the return type of an operator, in your case the MapOperator. It is only necessary if Flink

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-09-04 Thread Fabian Hueske
Hi, readFile() requests a FileInputFormat, i.e., your custom InputFormat would need to extend FileInputFormat. In general, any InputFormat decides about what to read when generating InputSplits. In your case the, createInputSplits() method should return one InputSplit for each file it wants to

Re: Help with table UDF

2017-08-31 Thread Fabian Hueske
Hi Flavio, you're using the TableFunction not correctly. The documentation shows how to call it in a join() method. But I agree, the error message should be better. Best, Fabian 2017-08-31 18:53 GMT+02:00 Flavio Pompermaier : > Hi all, > I'm using Flink 1.3.1 and I'm

Re: BlobCache and its functioning

2017-08-31 Thread Fabian Hueske
Hi Federico, Not sure what's going on there but Nico (in CC) is more familiar with the blob cache and might be able to help. Best, Fabian 2017-08-30 15:35 GMT+02:00 Federico D'Ambrosio : > Hi, > > I have a rather simple Flink job which has a KinesisConsumer as a source >

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-08-31 Thread Fabian Hueske
Hi, this is a valid approach. It might suffer from unbalanced load if the reader tasks process the files at different speed (or the files vary in size) because each task has to process the same number of files. An alternative would be to implement your own InputFormat. The input format would

Re: Twitter example

2017-08-30 Thread Fabian Hueske
t; setup. I believe you are referring to the flink log files > > Sent from my iPhone > > On Aug 30, 2017, at 12:56 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > [image: Boxbe] <https://www.boxbe.com/overview> Fabian Hueske ( > fhue...@gmail.com) is not on your

Re: Union limit

2017-08-30 Thread Fabian Hueske
Hi b0c1, This is an limitation in Flink's optimizer. Internally, all binary unions are merged into a single n-ary union. The optimizer restricts the number of inputs for an operator to 64. You can work around this limitation with an identity mapper which prevents the union operators from

Re: EOFException related to memory segments during run of Beam pipeline on Flink

2017-08-30 Thread Fabian Hueske
Hi Reinier, this is in fact a bug that you stumbled upon. In general, Flink works very well with larger data sets and little memory and gracefully spills data to disk. The problem in your case is caused by a wrapped exception. Internally, Flink uses an EOFException to signal that the memory pool

Re: Elasticsearch Sink - Error

2017-08-30 Thread Fabian Hueske
That's correct Flavio. The issue has been reported as https://issues.apache.org/jira/browse/FLINK-7386 Best, Fabian 2017-08-30 9:21 GMT+02:00 Flavio Pompermaier : > I also had problems with ES 5.4.3 and I had to modify the connector > code...I fear that the code is

Re: Twitter example

2017-08-30 Thread Fabian Hueske
Hi, print() writes the data to the out files of the TaskManagers. So you need to go to the machine that runs the TM and check its out file which is located in the log folder. Best, Fabian 2017-08-29 23:53 GMT+02:00 Krishnanand Khambadkone : > I am trying to run the

Re: Experiencing long latency while using sockets

2017-08-10 Thread Fabian Hueske
e two sends is 0.07 ms at the sending side, > which is a java socket client. > > Or could it be that there is a timeout setting for scheduling data source > in Flink? > > > Thanks, > > Chao > > On 08/08/2017 02:58 AM, Fabian Hueske wrote: > > One pointer is the Stre

Re: Experiencing long latency while using sockets

2017-08-08 Thread Fabian Hueske
One pointer is the StreamExecutionEnvironment.setBufferTimeout() parameter. Flink's network stack collects records in buffers to send them over the network. A buffer is sent when it is completely filled or after a configurable timeout. So if your program does not process many records, these

Re: [ANNOUNCE] Apache Flink 1.3.2 released

2017-08-07 Thread Fabian Hueske
Thanks Aljoscha and everybody who contributed with bug reports and fixes! Best, Fabian 2017-08-07 11:07 GMT+02:00 Till Rohrmann : > Thanks to the community and Aljoscha for the hard work to complete Flink > 1.3.2. > > Cheers, > Till > > On Sat, Aug 5, 2017 at 9:12 AM,

Re: Access Sliding window

2017-08-04 Thread Fabian Hueske
TimeWindow.getStart() or TimeWindow.getEnd() -> https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/windows.html#incremental-window-aggregation-with-reducefunction 2017-08-04 22:43 GMT+02:00 Raj Kumar : > Thanks Fabian. > > The incoming events have the

Re: Proper way to establish bucket counts

2017-08-04 Thread Fabian Hueske
ooks like something that could be used to count the total events in > the overall sink, but doesn't look like it works on a per-bucket basis. > We're going to try adding a counter to the bucket status object. > > On Wed, Aug 2, 2017 at 3:02 AM, Fabian Hueske <fhue...@gmail.co

Re: Manually controlling time for integration test

2017-08-04 Thread Fabian Hueske
Hi Maxim, you could inject an AssignerWithPunctuatedWatermarks into your plan which emits a watermark for every record it sees. That way you can increment the logical time for every record. Best, Fabian 2017-08-04 16:27 GMT+02:00 Maksym Parkachov : > Hi, > > I'm

Re: Access Sliding window

2017-08-04 Thread Fabian Hueske
Hi Raj, you have to combine two streams. The first stream has the running avg + std-dev over the last 6 hours, the second stream has the 15 minute counts. Both streams emit one record every 15 minutes. What you wan to do is to join the two records of both streams with the same timestamp. You do

Re: state inside functions

2017-08-04 Thread Fabian Hueske
Hi Peter, function objects (such as an instance of a class that extends MapFunction) that are used to construct a plan are serialized using Java serialization and shipped to the workers for execution. Therefore, function classes must be Serializable. In general it is recommended to configure

Re: Proper way to establish bucket counts

2017-08-02 Thread Fabian Hueske
Hi Robert, Flink collects many metrics by default, including the number of records / events that go into each operator (see [1], System Metrics, IO, "numRecordsIn"). So, you would only need to access that metric. Best, Fabian [1]

Re: S3 Write Execption

2017-08-02 Thread Fabian Hueske
Hi Aneesha, the logs would show that Flink is going through a recovery cycle. Recovery means to cancel running tasks and start them again. If you don't see something like that in the logs, Flink continues to processing. I'm not familiar with the details of S3, so I can't tell if the exception

Re: Odd flink behaviour

2017-08-02 Thread Fabian Hueske
:00 Mohit Anchlia <mohitanch...@gmail.com>: > Thanks that worked. However, what I don't understand is wouldn't the open > call that I am inheriting have this logic already inbuilt? I am inheriting > FileInputFormat. > > On Tue, Aug 1, 2017 at 1:42 AM, Fabian Hueske <

Re: SQL API with Table API

2017-08-01 Thread Fabian Hueske
No, those are two different queries. The second query would also not work because the revenue table does not have the EventTime attribute. Best, Fabian 2017-08-01 13:03 GMT+02:00 nragon : > Hi, > > Can i expect the output from this: > > Table revenue =

Re: Access Sliding window

2017-08-01 Thread Fabian Hueske
The average would be computed over the aggregated 15-minute count values. The sliding window would emit every 15 minutes the average of all records that arrived within the last 6 hours. Since the preceding 15-minute tumbling window emits 1 record every 15 mins, this would be the avg over 24

Re: Using FileInputFormat - org.apache.flink.streaming.api.functions.source.TimestampedFileInputSplit

2017-08-01 Thread Fabian Hueske
Hi Mohit, these are just INFO log statements that do not necessarily indicate a problem. Is the program working otherwise or do you observe other problems? Best, Fabian 2017-08-01 0:32 GMT+02:00 Mohit Anchlia : > I even tried existing format but still same error: > >

Re: Does Flink support long term storage and complex queries

2017-08-01 Thread Fabian Hueske
Hi Basanth, Flink is not a storage system (neither for stream nor for batch data). Flink applications can be stateful and maintain very large state (in the order of several TBs) but state is always associated with an application. State can be queryable, so outside applications can run key-lookup

Re: Odd flink behaviour

2017-08-01 Thread Fabian Hueske
1, 2017 at 9:49 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Do you set reached to false in open()? >> >> >> Am 01.08.2017 2:44 vorm. schrieb "Mohit Anchlia" <mohitanch...@gmail.com >> >: >> >> And here i

Re: KeyBy State

2017-08-01 Thread Fabian Hueske
Hi, there is no built-in support for key changes. You might be able to feedback a changed key with an iteration edge, but not sure how well that works. Best, Fabian 2017-08-01 7:32 GMT+02:00 Govindarajan Srinivasaraghavan < govindragh...@gmail.com>: > Hi, > > I have a keyby state but the key

Re: Flatbuffers and Flink

2017-08-01 Thread Fabian Hueske
Hi, in principle you can use any data type with Flink including byte[]. However, all of your functions need the logic to interpret the bytes and you have to implement custom key extractors (if you need to keyBy or partition your stream). Best, Fabian 2017-08-01 2:09 GMT+02:00 Basanth Gowda

Re: multiple streams with multiple actions - proper way?

2017-08-01 Thread Fabian Hueske
Hi Peter, this kind of use case is supported, but it is best practice to split independent pipelines into individual jobs. One reason for that is to isolate failures and restarts. For example, I would split the program you posted into two programs, one for the "foo" topic and one of the "bar"

Re: Odd flink behaviour

2017-07-31 Thread Fabian Hueske
Do you set reached to false in open()? Am 01.08.2017 2:44 vorm. schrieb "Mohit Anchlia" : And here is the inputformat code: public class PDFFileInputFormat extends FileInputFormat { /** * */ private static final long serialVersionUID = -4137283038479003711L;

Re: Access Sliding window

2017-07-31 Thread Fabian Hueske
You can compute the average and std-dev in a WindowFunction that iterates over all records in the window (6h / 15min = 24). WIndowFunction [1] and CoProcessFunction [2] are described in the docs. [1]

Re: Flink QueryableState with Sliding Window on RocksDB

2017-07-31 Thread Fabian Hueske
Having an operator that updates state from one stream and queries it to process the other stream is actually a common pattern. As I said, I don't know your use case but I don't think that a CoProcessFunction would result in a mess. QueryableState will have quite a bit of overhead because the

Re: Flink QueryableState with Sliding Window on RocksDB

2017-07-31 Thread Fabian Hueske
I am not sure that this is impossible, but it is not the use case queryable state was designed for. I don't know the details of your application, but you could try to merge the updating and the querying operators into a single one. You could connect two streams with connect() and use a keyed

Re: AW: Is watermark used by joining two streams

2017-07-31 Thread Fabian Hueske
currently using keyBy on a timestamp > common across the streams. > > Regards, > Vijay Raajaa GS > > On Mon, Jul 31, 2017 at 1:16 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi, >> >> @Wei: You can implement very different behavior using a >>

Re: Flink QueryableState with Sliding Window on RocksDB

2017-07-31 Thread Fabian Hueske
Hi Biplob, given these requirements, I would rather not use a window but implement the functionality with a stateful ProcessFunction. A ProcessFunction can register timers, e.g., to remove inactive state. The state of a ProcessFunction can be made queryable. Best, Fabian 2017-07-31 9:52

Re: AW: Is watermark used by joining two streams

2017-07-31 Thread Fabian Hueske
ot;wei" <jixia...@googlemail.com> wrote: > > Hello Fabian, > > > > thank you for your answer! > > > > Does it mean that the operator will wait until get two watermarks from the > input streams and emits then the “slower” watermark? > > > > Be

Re: Flink QueryableState with Sliding Window on RocksDB

2017-07-31 Thread Fabian Hueske
Hi Biplob, What do you mean by "creating a sliding window on top of a state"? Sliding windows are typically defined on streams (data in motion) and not on state (data at rest). It seems that UpdatedTxnState always holds the last record that was received per key. Do you want to compute the

Re: Customer inputformat

2017-07-31 Thread Fabian Hueske
ad all the files in a directory? >> - how to make an inputformat a streaming input instead of batch? Eg: read >> as new files come to a dir. >> >> Thanks again. >> >> On Sun, Jul 30, 2017 at 12:53 AM, Fabian Hueske <fhue...@gmail.com> >> wrote: >>

Re: Access Sliding window

2017-07-31 Thread Fabian Hueske
Hi, I would first compute the 15 minute counts. Based on these counts, you compute the threshold by computing average and std-dev and then you compare the counts with the threshold. In pseudo code this could look as follows: DataStream requests = ... DataStream counts = requests.timeWindow(15

Re: Is watermark used by joining two streams

2017-07-30 Thread Fabian Hueske
Periodic and punctuated watermarks only differ in the way that they are generated. Afterwards they are treated the same. An operator with two input streams will always sync its own watermarks to the watermarks of both input streams, i.e., to the "slower" watermark of both inputs. So if the left

Re: Reading byte[] from socket

2017-07-30 Thread Fabian Hueske
Hi Paolo, have a look at SocketTextStreamFunction [1] which is internally used when StreamExecutionEnviornment.socketTextStream() is called. Best, Fabian [1]

Re: Customer inputformat

2017-07-30 Thread Fabian Hueske
Hi, Flink calls the reachedEnd() method before it calls nextRecord() and closes the IF when reachedEnd() returns true. So, it should not return true until nextRecord() was called and the first and last record was emitted. You might also want to built your PDFFileInputFormat on FileInputFormat

Re: Questions

2017-07-27 Thread Fabian Hueske
Hi Egor, There is the Row type which is not strongly typed (such as TupleX) but supports arbitrary number of fields and null-valued fields. The DataSet API does not have a split operator and implementing this would be much more difficult than one would expect. The problem is in the optimizer

Re: Distributed reading and parsing of protobuf files from S3 in Apache Flink

2017-07-27 Thread Fabian Hueske
Hi, it depends on the file format whether a file can be read in parallel or not. Basically, you have to be able to identify valid offsets from which you can start reading. There are a few techniques like fixed sized blocks with padding or a footer section with split offsets, but if the file is

Re: Class not found when deserializing

2017-07-27 Thread Fabian Hueske
Hi Paolo, do you get the ClassNotFoundException for TheGlobalModel or for another class? Did you maybe forget to include SerializationUtils in the classpath? Best, Fabian 2017-07-26 16:14 GMT+02:00 Paolo Cristofanelli < cristofanelli.pa...@gmail.com>: > Hi, > > I am trying to write and read in

Re: How can I set charset for flink sql?

2017-07-26 Thread Fabian Hueske
As Timo proposed, I would implement a Scalar user-defined function which returns a boolean and use that instead of LIKE. Have a look here [1]. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/table/udfs.html#scalar-functions 2017-07-26 3:47 GMT+02:00 Ted Yu

Re: Connecting to remote task manager failed

2017-07-26 Thread Fabian Hueske
Hi, Do you have an exception for the the CoGroup failure? Best, Fabian 2017-07-26 3:32 GMT+02:00 Charith Wickramarachchi < charith.dhanus...@gmail.com>: > I did some more digging. It seems the CoGroup operation failed in one of > the workers. But I do not face this issue when running other

Re: Flink shaded table API

2017-07-26 Thread Fabian Hueske
Hi, I tried to reproduce the problem, but did not get the exception you posted. I implemented a job similar to yours and ran it both from the IDE and using the flink submission client using the 1.3.1 binaries. In both cases, the schema was correctly printed. Do you get the exception in the IDE

Re: How can i merge more than one flink stream

2017-07-25 Thread Fabian Hueske
gt; * connect - you have to provide a CoprocessFunction. >>> >>> * window join/cogroup - you provide key selector functions, a time >>> window and a join/cogroup function. >>> >>> With the first method, you have to write more code, in exchange f

Re: Count Different Codes in a Window

2017-07-24 Thread Fabian Hueske
Hi Raj, You can use ReduceFunction in combination with a WindowFunction [1]. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/windows.html#windowfunction-with-incremental-aggregation 2017-07-24 20:31 GMT+02:00 Raj Kumar : > Thanks

Re: Count Different Codes in a Window

2017-07-24 Thread Fabian Hueske
Hi Raj, I would recommend to use a ReduceFunction instead of a WindowFunction. The benefit of ReduceFunction is that it can be eagerly computed whenever an element is put into the window such that the state of the window is only one element. In contrast, the WindowFunction collects all elements

Re: Find the running median from a data stream

2017-07-24 Thread Fabian Hueske
Hi Gabriele, I don't think you can compute the exact running median on a stream. This would require to collect all elements of the stream so you would basically need to put the complete stream into the ValueState. Even if the state is backed by RocksDB, the state for a specific key needs to fit

Re: Flink ML with DataStream

2017-07-21 Thread Fabian Hueske
I. > > It would be great if we could produce something reusable for the community. > > > > > > *From:* Fabian Hueske [mailto:fhue...@gmail.com] > *Sent:* Wednesday, July 19, 2017 2:12 PM > *To:* Branham, Jeremy [IT] <jeremy.d.bran...@sprint.com> > *Cc:* user@flink.

Re: FlinkKafkaConsumer010 - Memory Issue

2017-07-19 Thread Fabian Hueske
Hi, Gordon (in CC) knows the details of Flink's Kafka consumer. He might know how to solve this issue. Best, Fabian 2017-07-19 20:23 GMT+02:00 PedroMrChaves : > Hello, > > Whenever I submit a job to Flink that retrieves data from Kafka the memory > consumption

Re: Flink ML with DataStream

2017-07-19 Thread Fabian Hueske
Hi, unfortunately, it is not possible to convert a DataStream into a DataSet. Flink's DataSet and DataStream APIs are distinct APIs that cannot be used together. The FlinkML library is only available for the DataSet API. There is some ongoing work to add a machine learning library for streaming

<    5   6   7   8   9   10   11   12   13   14   >