Re: Multiple select single result

2019-01-14 Thread Fabian Hueske
"datastreamcalcrule" grows beyond 64 kb > > Cheers > Dhanuka > > > On Mon, 14 Jan 2019, 20:30 Fabian Hueske >> Hi, >> >> you should avoid the UNION ALL approach because the query will scan the >> (identical?) Kafka topic 200 times which is highly ine

Re: Multiple select single result

2019-01-14 Thread Fabian Hueske
t;> On Mon, Jan 14, 2019 at 9:54 AM dhanuka ranasinghe < >>> dhanuka.priyan...@gmail.com> wrote: >>> >>>> Hi Fabian, >>>> >>>> Thanks for the prompt reply and its working 🤗. >>>> >>>> I am trying to depl

Re: Multiple select single result

2019-01-13 Thread Fabian Hueske
Hi Dhanuka, The important error message here is "AppendStreamTableSink requires that Table has only insert changes". This is because you use UNION instead of UNION ALL, which implies duplicate elimination. Unfortunately, UNION is currently internally implemented as a regular aggregration which pro

Re: Reducing runtime of Flink planner

2019-01-10 Thread Fabian Hueske
Hi Niklas, The planning time of a job does not depend on the data size. It would be the same whether you process 5MB or 5PB. FLINK-10566 (as pointed to by Timo) fixed a problem for plans with many braching and joining nodes. Looking at your plan, there are some, but (IMO) not enough to be problem

Re: [DISCUSS] Dropping flink-storm?

2019-01-10 Thread Fabian Hueske
>> > >> https://github.com/apache/flink/blob/master/flink-contrib/flink-storm/src/main/java/org/apache/flink/storm/wrappers/FlinkTopologyContext.java >> > >> > Cheers, >> > Till >> > >> > On Tue, Oct 9, 2018 at 10:08 AM Fabia

Re: Flink SQL client always cost 2 minutes to submit job to a local cluster

2019-01-03 Thread Fabian Hueske
Hi, You can try to build a JAR file with all runtime dependencies of Flink SQL (Calcite, Janino, +transitive dependencies), add it to the lib folder, and exclude the dependencies from the JAR file that is sent to the cluster when you submit a job. It would also be good to figure out what takes so

[ANNOUNCE] Berlin Buzzwords 2019 - Call for Presentations open until Febr. 17th

2018-12-21 Thread Fabian Hueske
Hi everyone, The Call for Presentations for the Berlin Buzzwords 2019 conference is open until Febr. 17th. Berlin Buzzwords [1] is an amazing conference on all things Scale, Search, and Stream(!) with a great and open minded community and a strong focus on open source. Next year's edition takes p

Re: Need the way to create custom metrics

2018-12-18 Thread Fabian Hueske
Hi, AFAIK it is not possible to collect metrics for an AggregateFunction. You can open a feature request by creating a Jira issue. Thanks, Fabian Am Mo., 17. Dez. 2018 um 21:29 Uhr schrieb Gaurav Luthra < gauravluthra6...@gmail.com>: > Hi, > > I need to know the way to implement custom metrics

[ANNOUNCE] Call for Presentations Extended for Flink Forward San Francisco 2019

2018-12-01 Thread Fabian Hueske
Flink Forward San Francisco Am Di., 20. Nov. 2018 um 17:57 Uhr schrieb Fabian Hueske : > Hi Everyone, > > Flink Forward San Francisco will *take place on April 1st and 2nd 2019*. > Flink Forward is a community conference organized by data Artisans and > gathers many members of the

Re: How to apply watermark on datastream and then do join operation on it

2018-11-30 Thread Fabian Hueske
Hi, Welcome to the mailing list. What exactly is your problem? Do you receive an error message? Is the program not compiling? Do you receive no output? Regardless of that, I would recommend to provide the timestamp extractors to the Kafka source functions. Also, I would have a close look at the w

Re: [Table API example] Table program cannot be compiled. This is a bug. Please file an issue

2018-11-30 Thread Fabian Hueske
Hi Marvin, Can you post the query (+ schema of tables) that lead to this exception? Thank you, Fabian Am Fr., 30. Nov. 2018 um 10:55 Uhr schrieb Marvin777 < xymaqingxiang...@gmail.com>: > Hi all, > > I have a simple test for looking at Flink Table API and hit an exception > reported as a bug.

Re: Looking for relevant sources related to connecting Apache Flink and Edgent.

2018-11-30 Thread Fabian Hueske
Hi Felipe, You can define TableSources (for SQL, Table API) that support filter push-down. The optimizer will figure out this opportunity and hand filters to a custom TableSource. I should add that AFAIK this feature is not used very often (expect some rough edges) and that the API is likely to c

Re: Dataset column statistics

2018-11-29 Thread Fabian Hueske
ve or > Spark) in Flink Table API? > > Best, > Flavio > > On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske wrote: > >> Hi, >> >> You could try to enable object reuse. >> Alternatively you can give more heap memory or fine tune the GC >> parameters. &

Re: Dataset column statistics

2018-11-29 Thread Fabian Hueske
Hi, You could try to enable object reuse. Alternatively you can give more heap memory or fine tune the GC parameters. I would not consider it a bug in Flink, but might be something that could be improved. Fabian Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier < pomperma...@okkam.

Re: OutOfMemoryError while doing join operation in flink

2018-11-28 Thread Fabian Hueske
Hi, Flink handles large data volumes quite well, large records are a bit more tricky to tune. You could try to reduce the number of parallel tasks per machine (#slots per TM, #TMs per machine) and/or increase the amount of available JVM memory (possible in exchange for managed memory as Zhijiang s

Re: your advice please regarding state

2018-11-27 Thread Fabian Hueske
Hi Avi, I'd definitely go for approach #1. Flink will hash partition the records across all nodes. This is basically the same as a distributed key-value store sharding keys. I would not try to fine tune the partitioning. You should try to use as many keys as possible to ensure an even distribution

Re: understadning kafka connector - rebalance

2018-11-26 Thread Fabian Hueske
Hi, DataStream x = ... x.rebalance().keyBy() is not a good idea. It will first distribute the records round-robin (over the network) and subsequently partition them by hash. The first shuffle is unnecessary. It does not have any effect because it is undone by the second partitioning. Btw. any m

Re: Joining more than 2 streams

2018-11-26 Thread Fabian Hueske
Hi, Yes, your reasoning is correct. If you use two binary joins, the data of the first two streams will be buffered twice. Unioning all three streams and joining them in a custom ProcessFunction would reduce the amount of required state. Best, Fabian Am Sa., 24. Nov. 2018 um 14:08 Uhr schrieb Ga

Re: CEP Dynamic Patterns

2018-11-26 Thread Fabian Hueske
Hi Steve, No this feature has not been contributed yet. Best, Fabian Am Fr., 23. Nov. 2018 um 20:58 Uhr schrieb Steve Bistline < srbistline.t...@gmail.com>: > Have dynamic patterns been introduced yet? > > Steve >

Re: Flink JSON (string) to Pojo (and vice versa) example

2018-11-22 Thread Fabian Hueske
Thanks Flavio. This looks very useful. AFAIK, the Calcite community is also working on functions for JSON <-> Text conversions which are part of SQL:2016. Hopefully, we can leverage their implementations in Flink's SQL support. Best, Fabian Am Di., 20. Nov. 2018 um 18:27 Uhr schrieb Flavio Pompe

Re: Deadlock happens when sink to mysql

2018-11-22 Thread Fabian Hueske
Hi, Which TableSource and TableSink do you use? Best, Fabian Am Mo., 19. Nov. 2018 um 15:39 Uhr schrieb miki haiat : > can you share your entire code please > > On Mon, Nov 19, 2018 at 4:03 PM 徐涛 wrote: > >> Hi Experts, >> I use the following sql, and sink to mysql, >> select >> >> alb

Re: Flink streaming automatic scaling (1.6.1)

2018-11-21 Thread Fabian Hueske
Hi, Flink 1.6 does not support automatic scaling. However, there is a REST call to trigger the rescaling of a job. You need to call it manually though. Have a look at the */jobs/:jobid/rescaling *call in the REST docs [1]. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-

Re: Can JDBCSinkFunction support exectly once?

2018-11-21 Thread Fabian Hueske
Hi, JDBCSinkFunction is a simple wrapper around the JDBCOutputFormat (the DataSet / Batch API output interface). Dominik is right, that JDBCSinkFunction does not support exactly-once output. It is not strictly required that an exactly-once sink implements TwoPhaseCommitFunction. TPCF is a conveni

Re: TaskManager & task slots

2018-11-21 Thread Fabian Hueske
Yes, this hasn't changed. Best, Fabain Am Mi., 21. Nov. 2018 um 08:18 Uhr schrieb yinhua.dai < yinhua.2...@outlook.com>: > Hi Fabian, > > Is below description still remain the same in Flink 1.6? > > Slots do not guard CPU time, IO, or JVM memory. At the moment they only > isolate managed memory

[ANNOUNCE] Flink Forward San Francisco Call for Presentations closes soon

2018-11-20 Thread Fabian Hueske
Hi Everyone, Flink Forward San Francisco will *take place on April 1st and 2nd 2019*. Flink Forward is a community conference organized by data Artisans and gathers many members of the Flink community, including users, contributors, and committers. It is the perfect event to get in touch and conne

Re: Flink Table Duplicate Evaluation

2018-11-20 Thread Fabian Hueske
Hi Niklas, The workaround that you described should work fine. However, you don't need a custom sink. Converting the Table into a DataSet and registering the DataSet again as a Table is currently the way to solve this issue. Best, Fabian Am Di., 20. Nov. 2018 um 17:13 Uhr schrieb Niklas Teichman

Re: Group by with null keys

2018-11-20 Thread Fabian Hueske
Hi Flavio, Whether groupBy with null values works or not depends on the type of the key, or more specifically on the TypeComparator and TypeSerializer that are used to serialize, compare, and hash the key type. The processing engine supports null values If the comparator and serializer can handle

Re: Flink Serialization and case class fields limit

2018-11-16 Thread Fabian Hueske
Hi Andrea, I wrote the post 2.5 years ago. Sorry, I don't think that I kept the code somewhere, but the mechanics in Flink should still be the same. Best, Fabian Am Fr., 16. Nov. 2018, 20:06 hat Andrea Sella geschrieben: > Hi Andrey, > > My bad, I forgot to say that I am using Scala 2.11, that

Re: Stopping a streaming app from its own code : behaviour change from 1.3 to 1.6

2018-11-09 Thread Fabian Hueske
Hi Arnaud, Thanks for reporting the issue! Best, Fabian Am Do., 8. Nov. 2018 um 20:50 Uhr schrieb LINZ, Arnaud < al...@bouyguestelecom.fr>: > 1.FLINK-10832 > > Created (with heavy difficulties as typing java code in a jira description > wa

Re: Always trigger calculation of a tumble window in Flink SQL

2018-11-09 Thread Fabian Hueske
Hi, SQL does not support any custom triggers or timers. In general, computations are performed when they are complete with respect to the watermarks (applies for GROUP BY windows, OVER windows, windowed and time-versioned joins, etc. Best, Fabian Am Fr., 9. Nov. 2018 um 05:08 Uhr schrieb yinhua.

Re: Manually clean SQL keyed state

2018-11-09 Thread Fabian Hueske
Hi Shahar, That's not possible at the moment. The SQL API does not provide any knobs to control state size besides the idle state retention. The reason is that it aims to be as accurate as possible. In the future it might be possible to provide more information to the system (like constraints in

Re: Counting DataSet in DataFlow

2018-11-07 Thread Fabian Hueske
r(if cnt != 0).withBroadcastSet("cnt", count).doSomethingElse Fabian [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#broadcast-variables Am Mi., 7. Nov. 2018 um 14:10 Uhr schrieb Fabian Hueske : > Hi, > > Counting always requires a job to be executed. > N

Re: Counting DataSet in DataFlow

2018-11-07 Thread Fabian Hueske
Hi, Counting always requires a job to be executed. Not sure if this is what you want to do, but if you want to prevent to get an empty result due to an empty cross input, you can use a mapPartition() with parallelism 1 to emit a special record, in case the MapPartitionFunction didn't see any data.

Re: Split one dataset into multiple

2018-11-06 Thread Fabian Hueske
You have to define a common type, like an n-ary Either type and return that from your source / operator. The resulting DataSet can be consumed by multiple FlatmapFunctions, each extracting and forwarding one of the the result types. Cheers, Fabian Am Di., 6. Nov. 2018 um 10:40 Uhr schrieb madan :

Re: Non deterministic result with Table API SQL

2018-11-05 Thread Fabian Hueske
Thanks Flavio for reporting the error helping to debug it. A job to reproduce the error is very valuable :-) Best, Fabian Am Mo., 5. Nov. 2018 um 14:38 Uhr schrieb Flavio Pompermaier < pomperma...@okkam.it>: > Here it is the JIRA ticket and, attached to if, the Flink (Java) job to > reproduce th

Re: Java Table API and external catalog bug?

2018-10-25 Thread Fabian Hueske
IIRC, that was recently fixed. Might come out with 1.6.2 / 1.7.0. Cheers, Fabian Flavio Pompermaier schrieb am Do., 25. Okt. 2018, 14:09: > Ok thanks! I wasn't aware of this..that would be undoubtedly useful ;) > On Thu, Oct 25, 2018 at 2:00 PM Timo Walther wrote: > >> Hi Flavio, >> >> the ex

Re: Need help to understand memory consumption

2018-10-22 Thread Fabian Hueske
If yes, does that mean that I have to purge old state > backend in RocksDB ? > > Thanks a lot ! > > Regards, > Julien. > > - Mail original - > De: "Fabian Hueske" > À: "wangzhijiang999" > Cc: "Paul Lam" , jpreis...@fre

Re: Understand Broadcast State in Node Failure Case

2018-10-22 Thread Fabian Hueske
Hi Chengzhi, Broadcast State is checkpointed like any other state and will be restored in all failure cases (including the ones you mentioned). We added the warning to inform users that Broadcast state will also be stored in the JVM memory, even if the RocksDB StateBackend was configured (which st

Re: Checkpointing when one of the sources has completed

2018-10-17 Thread Fabian Hueske
Hi Niels, Checkpoints can only complete if all sources are running. That's because the checkpoint mechanism relies on injecting checkpoint barriers into the stream at the sources. Best, Fabian Am Mi., 17. Okt. 2018 um 11:11 Uhr schrieb Paul Lam : > Hi Niels, > > Please see https://issues.apache

Re: Need help to understand memory consumption

2018-10-17 Thread Fabian Hueske
Hi, As was said before, managed memory (as described in the blog post [1]) is only used for batch jobs. By default, managed memory is only lazily allocated, i.e., when a batch job is executed. Streaming jobs maintain state in state backends. Flink provides state backends that store the state on t

Re: When does Trigger.clear() get called?

2018-10-15 Thread Fabian Hueske
Hi, Re Q1: The main purpose of the Trigger.clean() method is to remove all custom state of the Trigger. State must be explicitly removed, otherwise the program leaks memory. Re Q3: If you are using a keyed stream, you need to manually clean up the state by calling State.clear(). If you are using a

Re: Not all files are processed? Stream source with ContinuousFileMonitoringFunction

2018-10-12 Thread Fabian Hueske
Hi, Which file system are you reading from? If you are reading from S3, this might be cause by S3's eventual consistency property. Have a look at FLINK-9940 [1] for a more detailed discussion. There is also an open PR [2], that you could try to patch the source operator with. Best, Fabian [1] ht

Re: When does Trigger.clear() get called?

2018-10-12 Thread Fabian Hueske
Hi Andrew, The PURGE action of a window removes the window state (i.e., the collected events or computed aggregate) but the window meta data including the Trigger remain. The Trigger.close() method is called, when the winodw is completely (i.e., all meta data) discarded. This happens, when the tim

Re: Identifying missing events in keyed streams

2018-10-11 Thread Fabian Hueske
I'd go with 2) because the logic is simple and it is (IMO) much easier to understand what is going on and what state is kept. Am Do., 11. Okt. 2018 um 12:42 Uhr schrieb Averell : > Hi Fabian, > > Thanks for the suggestion. > I will try with that support of removing timers. > > I have also tried a

Re: Partitions vs. Subpartitions

2018-10-11 Thread Fabian Hueske
Hi Chris, The terminology in the docs and code is not always consistent. It depends on the context. Both could also mean the same if they are used in different places. Can you point to the place(s) that refer to partition and subpartition? Fabian Am Do., 11. Okt. 2018 um 04:50 Uhr schrieb Kurt

Re: Identifying missing events in keyed streams

2018-10-10 Thread Fabian Hueske
Hi Averell, I'd go with approach 2). As of Flink 1.6.0 you can delete timers. But even if you are on a pre-1.6 version, a ProcessFunction would be the way to go, IMO. You don't need to register a timer for each event. Instead, you can register the first timer with the first event and have a state

Re: [deserialization schema] skip data, that couldn't be properly deserialized

2018-10-10 Thread Fabian Hueske
Hi Rinat, Thanks for discussing this idea. Yes, I think this would be a good feature. Can you open a Jira issue and describe the feature? Thanks, Fabian Am Do., 4. Okt. 2018 um 19:28 Uhr schrieb Rinat : > Hi mates, in accordance with the contract of > org.apache.flink.formats.avro.Deserializati

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-10 Thread Fabian Hueske
Hi Xuefu, Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great! Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard: * Support for Hive UDFs * Support for Hive metadata cat

Re: getRuntimeContext(): The runtime context has not been initialized.

2018-10-10 Thread Fabian Hueske
Yes, it would be good to post your code. Are you using a FoldFunction in a window (if yes, what window) or as a running aggregate? In general, collecting state in a FoldFunction is usually not something that you should do. Did you consider using an AggregateFunction? Fabian Am Mi., 10. Okt. 2018

Re: [DISCUSS] Dropping flink-storm?

2018-10-09 Thread Fabian Hueske
Yes, let's do it this way. The wrapper classes are probably not too complex and can be easily tested. We have the same for the Hadoop interfaces, although I think only the Input- and OutputFormatWrappers are actually used. Am Di., 9. Okt. 2018 um 09:46 Uhr schrieb Chesnay Schepler < ches...@apach

Re: Duplicates in self join

2018-10-08 Thread Fabian Hueske
Did you check the new interval join that was added with Flink 1.6.0 [1]? It might be better suited because, each record has its own boundaries based on its timestamp and the join window interval. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/joi

Re: Scala case class state evolution

2018-10-03 Thread Fabian Hueske
I know that Gordon (in CC) has looked closer into this problem. He should be able to share restrictions and maybe even workarounds. Best, Fabian Hequn Cheng schrieb am Mi., 3. Okt. 2018, 05:09: > Hi Elias, > > From my understanding, you can't do this since the state will no longer be > compatib

Re: Flink Python streaming

2018-10-03 Thread Fabian Hueske
Hi, AFAIK it's not that easy. Flink's Python support is based on Jython which translates Python code into JVM byte code. Therefore, native libs are not supported. Chesnay (in CC) knows the details here. Best, Fabian Hequn Cheng schrieb am Mi., 3. Okt. 2018, 04:30: > Hi Bing, > > I'm not famil

Re: Deserialization of serializer errored

2018-10-02 Thread Fabian Hueske
Hi Elias, I am not familiar with the recovery code, but Flink might read (some of ) the savepoint data even though it is not needed and loaded into operators. That would explain why you see an exception when the case class is modified or completely removed. Maybe Stefan or Gordon can help here.

Re: Does Flink SQL "in" operation has length limit?

2018-10-01 Thread Fabian Hueske
Hi, I had a look into the code. From what I saw, we are translating the values into Rows. The problem here is that the IN clause is translated into a join and that the join results contains a time attribute field. This is a safety restriction to ensure that time attributes do not lose their waterm

Re: About the retract of the calculation result of flink sql

2018-10-01 Thread Fabian Hueske
Hi Clay, If you do env.setParallelism(1), the query won't be executed in parallel. However, looking at your screenshot the message order does not seem to be the problem here (given that you printed the content of the topic). Are you sure that it is not possible that the result decreases if some r

Re: flink-1.6.1-bin-scala_2.11.tgz fails signture and hash verification

2018-10-01 Thread Fabian Hueske
Hi Gianluca, I tried to validate the issue but hash and signature are OK for me. Do you remember which mirror you used to download the binaries? Best, Fabian Am Sa., 29. Sep. 2018 um 17:24 Uhr schrieb vino yang : > Hi Gianluca, > > This is very strange, Till may be able to give an explanation

Re: I want run flink program in ubuntu x64 Mult Node Cluster what is configuration?

2018-10-01 Thread Fabian Hueske
Hi, these issues are not related to Flink but rather generic Linux / bash issues. Ensure that the start scripts are executable (can be changed with chmod) your user has the right permissions to executed the start scripts. Also, you have to use the right path to the scripts. If you are in the base

Re: Regarding implementation of aggregate function using a ProcessFunction

2018-10-01 Thread Fabian Hueske
Hi, There are basically three options: 1) Use an AggregateFunction and store everything that you would put into state into the Accumulator. This can become quite expensive because the Accumulator is de/serialized for every function call if you use RocksDB. The advantage is that you don't have to s

Re: In-Memory Lookup in Flink Operators

2018-10-01 Thread Fabian Hueske
Hi Chirag, Flink 1.5.0 added support for BroadcastState which should address your requirement of replicating the data. [1] The replicated data is stored in the configured state backend which can also be RocksDB. Regarding the reload, I would recommend Lasse's approach of having a custom source t

Re: Flink Scheduler Customization

2018-10-01 Thread Fabian Hueske
Hi Ananth, You can certainly do this with Flink, but there are no built-in operators for this. What you probably want to do is to compare the timestamp of the event with the current processing time and drop the record if it is too old. If the timestamp is encoded in the record, you can do this in

Re: [DISCUSS] Dropping flink-storm?

2018-10-01 Thread Fabian Hueske
+1 to drop it. Thanks, Fabian Am Sa., 29. Sep. 2018 um 12:05 Uhr schrieb Niels Basjes : > I would drop it. > > Niels Basjes > > On Sat, 29 Sep 2018, 10:38 Kostas Kloudas, > wrote: > > > +1 to drop it as nobody seems to be willing to maintain it and it also > > stands in the way for future deve

Re: Streaming to Parquet Files in HDFS

2018-10-01 Thread Fabian Hueske
Hi Bill, Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the previously mentioned StreamingFileSink [1], [2]. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9753 [2] https://issues.apache.org/jira/browse/FLINK-9750 Am Fr., 28. Sep. 2018 um 23:36 Uhr schrieb

Re: Can't get the FlinkKinesisProducer to work against Kinesalite for tests

2018-10-01 Thread Fabian Hueske
Hi Bruno, Thanks for sharing your approach! Best, Fabian Am Do., 27. Sep. 2018 um 18:11 Uhr schrieb Bruno Aranda : > Hi again, > > We managed at the end to get data into Kinesalite using the > FlinkKinesisProducer, but to do so, we had to use different configuration, > such as ignoring the 'aws

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

2018-09-26 Thread Fabian Hueske
Should we add a warning to the release announcements? Fabian Am Mi., 26. Sep. 2018 um 10:22 Uhr schrieb Robert Metzger < rmetz...@apache.org>: > Hey Jamie, > > we've been facing the same issue with dA Platform, when running Flink > 1.6.1. > I assume a lot of people will be affected by this. > >

Re: How to join stream and batch data in Flink?

2018-09-25 Thread Fabian Hueske
Hi, I don't think that using the current join implementation in the Table API / SQL will work. The non-windowed join fully materializes *both* input tables in state. This is necessary, because the join needs to be able to process updates on either side. While this is not a problem for the fixed si

Re: Get last element of a DataSe

2018-09-25 Thread Fabian Hueske
Hi, Can you post the full stacktrace? Thanks, Fabian Am Di., 25. Sep. 2018 um 12:55 Uhr schrieb Alejandro Alcalde < algu...@gmail.com>: > Hi, > > I am trying to improve the efficiency of this code: > > discretized.map(_._2) > .name("Map V") > .reduce((_, b) ⇒ b) >

Re: Potential bug in Flink SQL HOP and TUMBLE operators

2018-09-18 Thread Fabian Hueske
Hi John, Just to clarify, this missing data is due to the starting overhead and not due to a bug? Best, Fabian 2018-09-18 15:35 GMT+02:00 John Stone : > Thank you all for your assistance. I believe I've found the root cause if > the behavior I am seeing. > > If I just use "SELECT * FROM MyEven

Re: Add operator ids for an already running job

2018-09-18 Thread Fabian Hueske
, > Paul Lam > > > 在 2018年9月18日,20:09,Fabian Hueske 写道: > > The auto-generated ids are included in the savepoint data. So, it should > be possible to them from the savepoint. > However, AFAIK, there is no tool to do that. You'd need to manually dig > into the seriali

Re: Add operator ids for an already running job

2018-09-18 Thread Fabian Hueske
The auto-generated ids are included in the savepoint data. So, it should be possible to them from the savepoint. However, AFAIK, there is no tool to do that. You'd need to manually dig into the serialized data. Cheers, Fabian 2018-09-18 13:30 GMT+02:00 vino yang : > Hi Paul, > > Referring to the

Re: Migration to Flink 1.6.0, issues with StreamExecutionEnvironment.registerCachedFile

2018-09-18 Thread Fabian Hueske
Hi, The functionality of the SQL ScalarFunction is backed by Flink's distributed cache and just passes on the function call. I tried it locally on my machine and it works for me. What is your setup? Are you running on Yarn? Maybe Chesnay or Dawid (added to CC) can help to track the problem down.

Re: In which case the StreamNode has multiple output edges?

2018-09-18 Thread Fabian Hueske
Hi, Any operator can have multiple out-going edges. If you implement something like: DataStream instream = ... DataStream outstream1 = instream.map(new MapFunc1()); DataStream outstream2 = instream.map(new MapFunc2()); The node representing instream will have two outgoing edges that lead to the

Re: CombinableGroupReducer says The Iterable can be iterated over only once

2018-09-17 Thread Fabian Hueske
Hi Alejandro, asScala calls iterator() the first time and reduce() another time. These iterators can only be iterated once because they are possibly backed by multiple sorted files which have been spilled to disk and are merge-sorted while iterating. I'm actually surprised that you found this cod

Re: Potential bug in Flink SQL HOP and TUMBLE operators

2018-09-17 Thread Fabian Hueske
Hmm, that's interesting. HOP and TUMBLE window aggregations are directly translated into their corresponding DataStream counterparts (Sliding, Tumble). There should be no filtering of records. I assume you tried a simple query like "SELECT * FROM MyEventTable" and received all expected data? Fabi

Re: Expire records in FLINK dynamic tables?

2018-09-17 Thread Fabian Hueske
Hi Chen, Yes, this is possible. Have a look at the configuration of idle state retention time [1]. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/table/streaming.html#idle-state-retention-time 2018-09-17 20:10 GMT+02:00 burgesschen : > Hi everyone, > I'm tryin

Re: Potential bug in Flink SQL HOP and TUMBLE operators

2018-09-17 Thread Fabian Hueske
Hi John, Are you sure that the first rows of the first window are dropped? When a query with processing time windows is terminated, the last window is not computed. This in fact intentional and does not apply to event-time windows. Best, Fabian 2018-09-17 17:21 GMT+02:00 John Stone : > Hello,

Re: Weird behaviour after change sources in a job.

2018-09-13 Thread Fabian Hueske
Hi, The problem is that Flink SQL does not expose the UIDs of the generated operators. We've met that issue before, but it is still not fully clear what would be the best way to this accessible. Best, Fabian 2018-09-13 5:15 GMT-04:00 Dawid Wysakowicz : > Hi Oleksandr, > > The mapping of state t

Re: How does flink read a DataSet?

2018-09-12 Thread Fabian Hueske
> is pushed from operators to operators in both stream and batch > > On Wed 12 Sep, 2018, 4:28 PM Fabian Hueske, wrote: > >> Actually, some parts of Flink's batch engine are similar to streaming as >> well. If the data does not need to be sorted or put into a hash

Re: How does flink read a DataSet?

2018-09-12 Thread Fabian Hueske
Actually, some parts of Flink's batch engine are similar to streaming as well. If the data does not need to be sorted or put into a hash-table, the data is pipelined (like in many relational database systems). For example, if you have a job that joins two inputs with a HashJoin, only the build side

Re: Missing Calcite SQL functions in table API

2018-09-05 Thread Fabian Hueske
Hi You are using SQL syntax in a Table API query. You have to stick to Table API syntax or use SQL as tEnv.sqlQuery("SELECT col1,CONCAT('field1:',col2,',field2:',CAST(col3 AS string)) FROM csvTable") The Flink documentation lists all supported functions for Table API [1] and SQL [2]. Best, Fabi

Re: Why don't operations on KeyedStream return KeyedStream?

2018-08-29 Thread Fabian Hueske
Hi Elias, Your assumption is correct. An operation on a KeyedStream results in a regular DataStream because the operation might change the data type or the key field. Hence, it is not guaranteed that the same keys can be extracted from the output of the keyed operation. However, there is a way to

Re: Queryable state and state TTL

2018-08-29 Thread Fabian Hueske
Hi, I guess that this is not a fundamental problem but just a limitation in the current implementation. Andrey (in CC) who implemented the TTL support should be able to give more insight on this issue. Best, Fabian Am Mi., 29. Aug. 2018 um 04:06 Uhr schrieb vino yang : > Hi Elias, > > From the

Re: Small-files source - partitioning based on prefix of file

2018-08-28 Thread Fabian Hueske
Hi, CMCF is not a source, only the file monitoring function is. Barriers are injected by the FMF when the JM sends a checkpoint message. The barriers then travel to the CMCF and trigger the Checkpoint ING. Fabian Averell schrieb am Di., 28. Aug. 2018, 12:02: > Hello Fabian, > > Thanks for t

Re: Semantic when table joins table from window

2018-08-28 Thread Fabian Hueske
GROUP BY >> article_id", the answer is "101,102,103" >> 2. if you change your sql to s"SELECT last_value(article_id) FROM >> praise", the answer is "100" >> >> Best, Hequn >> >> On Tue, Aug 21, 2018 at 8:52 PM, 徐涛 wrote:

Re: Small-files source - partitioning based on prefix of file

2018-08-28 Thread Fabian Hueske
Hi Averell, Barriers are injected into the regular data flow by source functions. In case of a file monitoring source, the barriers are injected into the stream of file splits that are passed to the ContinuousFileMonitoringFunction. The CFMF puts the splits into a queue and processes them with a d

Re: What's the advantage of using BroadcastState?

2018-08-28 Thread Fabian Hueske
B 56063, > Geschäftsführer: Bo PENG, Qiuen Peng, Shengli Wang > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including

Re: would you join a Slack workspace for Flink?

2018-08-27 Thread Fabian Hueske
paste text instead of screenshots of text > 3. you keep formatting when pasting code in order to keep the code readable > 4. there are enough import statements to avoid ambiguities > > > > On Mon, Aug 27, 2018 at 10:51 AM Fabian Hueske wrote: > >> Hi, >> >> I

Re: Raising a bug in Flink's unit test scripts

2018-08-27 Thread Fabian Hueske
Hi Averell, If this is a more general error, I'd prefer a separate issue & PR. Thanks, Fabian Am Fr., 24. Aug. 2018 um 13:15 Uhr schrieb Averell : > Good day everyone, > > I'm writing unit test for the bug fix FLINK-9940, and found that in some > existing tests in flink-fs-tests cannot detect t

Re: [DISCUSS] Remove the slides under "Community & Project Info"

2018-08-27 Thread Fabian Hueske
I agree to remove the slides section. A lot of the content is out-dated and hence not only useless but might sometimes even cause confusion. Best, Fabian Am Mo., 27. Aug. 2018 um 08:29 Uhr schrieb Renjie Liu < liurenjie2...@gmail.com>: > Hi, Stephan: > Can we put project wiki in some place? I

Re: would you join a Slack workspace for Flink?

2018-08-27 Thread Fabian Hueske
Hi, I don't think that recommending Gists is a good idea. Sure, well formatted and highlighted code is nice and much better than posting screenshots but Gists can be deleted. Deleting a Gist would make an archived thread useless. I would definitely support instructions on how to add code to a mail

Re: lack of function and low usability of provided function

2018-08-23 Thread Fabian Hueske
Hi Henry, Flink is an open source project. New build-in functions are constantly contributed to Flink. Right now, there are more than 5 PRs open to add or improve various functions. If you find that some functions are not working correctly or could be improved, you can open a Jira issue. The same

Re: Semantic when table joins table from window

2018-08-21 Thread Fabian Hueske
Hi, The semantics of a query do not depend on the way that it is used. praiseAggr is a table that grows by one row per second and article_id. If you use that table in a join, the join will fully materialize the table. This is a special case because the same row is added multiple times, so the stat

Re: Will idle state retention trigger retract in dynamic table?

2018-08-21 Thread Fabian Hueske
rtain threshold. In that case, the query would only operate on tail of the stream, e.g., the last day or week. Best, Fabian 2018-08-21 12:03 GMT+02:00 徐涛 : > Hi Fabian, > Is the behavior a bit weird? Because it leads to data inconsistency. > > Best, > Henry > > > 在 2018年8月21日,下午5

Re: Will idle state retention trigger retract in dynamic table?

2018-08-21 Thread Fabian Hueske
the dynamic table. > > Best, > Henry > > > 在 2018年8月21日,下午4:16,Fabian Hueske 写道: > > Hi, > > No, it won't. I will simply remove state that has not been accessed for > the configured time but not change the result. > For example, if you have a GROUP BY agg

Re: UTF-16 support for TextInputFormat

2018-08-21 Thread Fabian Hueske
blocker or that I've identified the right component. > I'm afraid I don't have the bandwidth or knowledge to make the kind of > pull request you really need. I do hope my suggestions prove a little > useful. > > Thank you, > David > > On Fri, Aug 10, 2018 at

Re: Will idle state retention trigger retract in dynamic table?

2018-08-21 Thread Fabian Hueske
Hi, No, it won't. I will simply remove state that has not been accessed for the configured time but not change the result. For example, if you have a GROUP BY aggregation and the state for a grouping key is removed, the operator will start a new aggregation if a record with the removed grouping ke

Re: What's the advantage of using BroadcastState?

2018-08-20 Thread Fabian Hueske
Hi, I've recently published a blog post about Broadcast State [1]. Cheers, Fabian [1] https://data-artisans.com/blog/a-practical-guide-to-broadcast-state-in-apache-flink 2018-08-20 3:58 GMT+02:00 Paul Lam : > Hi Rong, Hequn > > Your answers are very helpful! Thank you! > > Best Regards, > Paul

Re: Limit on number of files to read for Dataset

2018-08-15 Thread Fabian Hueske
tionManager$1.get(PoolingHttpClientConnectionMan > ager.java:263) > at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at com.amazon.ws.emr.hadoop.

Re: watermark does not progress

2018-08-15 Thread Fabian Hueske
Hi John, Watermarks cannot make progress if you have stream partitions that do not carry any data. What kind of source are you using? Best, Fabian 2018-08-15 4:25 GMT+02:00 vino yang : > Hi Johe, > > In local mode, it should also work. > When you debug, you can set a breakpoint in the getCurren

Re: how to assign issue to someone

2018-08-14 Thread Fabian Hueske
Hi, I've given you Contributor permissions for Jira and assigned the issue to you. You can now also assign other issue to you. Looking forward to your contribution. Best, Fabian 2018-08-14 19:45 GMT+02:00 Guibo Pan : > Hello, I am a new user for flink jira. I reported an issue and would like >

<    1   2   3   4   5   6   7   8   9   10   >