Re: StreamingFileSink Avro batch size and compression

2019-01-25 Thread Taher Koitawala
Can someone please help with this?

On Fri 25 Jan, 2019, 1:47 PM Taher Koitawala  Hi All,
>  Is there a way to specify *batch size* and *compression *properties
> when using StreamingFileSink just like we did in bucketing sink? The only
> parameters it is accepting is Inactivity bucket check interval and avro
> schema.
>
>   We have numerous flink jobs pulling data from the same kafka
> topics, however doing different operations. And each flink job is writing a
> file with different size and we would want to make it consistent.
>
>
> Regards,
> Taher Koitawala
> GS Lab Pune
> +91 8407979163
>


Re: Print table contents

2019-01-25 Thread Soheil Pourbafrani
Thanks a lot.

On Sat, Jan 26, 2019 at 10:22 AM Hequn Cheng  wrote:

> Hi Soheil,
>
> There is no print() or show() method in Table. As a workaround, you can
> convert[1] the Table into a DataSet and perform print() or collect() on the
> DataSet.
> You have to pay attention to the differences between DataSet.print() and
> DataSet.collect().
> For DataSet.print(), prints the elements in a DataSet to the standard
> output stream {@link System#out} of the JVM that calls the print() method.
> For programs that are executed in a cluster, this method needs to gather
> the contents of the DataSet back to the client, to print it there.
> For DataSet.collect(), get the elements of a DataSet as a List. As DataSet
> can contain a lot of data, this method should be used with caution.
>
> Best,
> Hequn
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/common.html#convert-a-table-into-a-dataset
>
>
> On Sat, Jan 26, 2019 at 3:24 AM Soheil Pourbafrani 
> wrote:
>
>> Hi,
>>
>> Using Flink Table object how can we print table contents, something like
>> Spark show() method?
>>
>> for example in the following:
>>
>> tableEnv.registerDataSet("Orders", raw, "id, country, num, about");
>> Table results = tableEnv.sqlQuery("SELECT id FROM Orders WHERE id > 10");
>>
>> How can I print the results variable contents?
>>
>> Thanks
>>
>


Re: Print table contents

2019-01-25 Thread Hequn Cheng
Hi Soheil,

There is no print() or show() method in Table. As a workaround, you can
convert[1] the Table into a DataSet and perform print() or collect() on the
DataSet.
You have to pay attention to the differences between DataSet.print() and
DataSet.collect().
For DataSet.print(), prints the elements in a DataSet to the standard
output stream {@link System#out} of the JVM that calls the print() method.
For programs that are executed in a cluster, this method needs to gather
the contents of the DataSet back to the client, to print it there.
For DataSet.collect(), get the elements of a DataSet as a List. As DataSet
can contain a lot of data, this method should be used with caution.

Best,
Hequn

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/common.html#convert-a-table-into-a-dataset


On Sat, Jan 26, 2019 at 3:24 AM Soheil Pourbafrani 
wrote:

> Hi,
>
> Using Flink Table object how can we print table contents, something like
> Spark show() method?
>
> for example in the following:
>
> tableEnv.registerDataSet("Orders", raw, "id, country, num, about");
> Table results = tableEnv.sqlQuery("SELECT id FROM Orders WHERE id > 10");
>
> How can I print the results variable contents?
>
> Thanks
>


Re: [DISCUSS] Towards a leaner flink-dist

2019-01-25 Thread Hequn Cheng
Hi Chesnay,

Thanks a lot for the proposal! +1 for a leaner flink-dist and improve the
"Download" page.
 I think a leaner flink-dist would be very helpful. If we bundle all jars
into a single one, this will easily cause class conflict problem.

Best,
Hequn


On Fri, Jan 25, 2019 at 2:48 PM jincheng sun 
wrote:

> Hi Chesnay,
>
> Thank you for the proposal. And i like it very much.
>
> +1 for the leaner distribution.
>
> About improve the "Download" page, I think we can add the connectors
> download link in the  "Optional components" section which @Timo Walther
>   mentioned above.
>
>
> Regards,
> Jincheng
>
> Chesnay Schepler  于2019年1月18日周五 下午5:59写道:
>
>> Hello,
>>
>> the binary distribution that we release by now contains quite a lot of
>> optional components, including various filesystems, metric reporters and
>> libraries. Most users will only use a fraction of these, and as such
>> pretty much only increase the size of flink-dist.
>>
>> With Flink growing more and more in scope I don't believe it to be
>> feasible to ship everything we have with every distribution, and instead
>> suggest more of a "pick-what-you-need" model, where flink-dist is rather
>> lean and additional components are downloaded separately and added by
>> the user.
>>
>> This would primarily affect the /opt directory, but could also be
>> extended to cover flink-dist. For example, the yarn and mesos code could
>> be spliced out into separate jars that could be added to lib manually.
>>
>> Let me know what you think.
>>
>> Regards,
>>
>> Chesnay
>>
>>


Re: Query on retract stream

2019-01-25 Thread Hequn Cheng
Hi Gagan,

Time attribute fields will be materialized by the unbounded groupby. Also,
currently, the window doesn't have the ability to handle retraction
messages. I see two ways to solve the problem.

- Use multi-window.  The first window performs lastValue, the second
performs count.
- Use two non-window aggregates. In this case, you don't have to change
anything for the first aggregate. For the second one, you can group by an
hour field and perform count(). The code looks like:

SELECT userId,
 count(orderId)
FROM
(SELECT orderId,
 getHour(orderTime) as myHour,
 lastValue(userId) AS userId,
 lastValue(status) AS status
FROM orders
GROUP BY  orderId, orderTime)
WHERE status='PENDING'
GROUP BY myHour, userId

Best,
Hequn




On Sat, Jan 26, 2019 at 12:29 PM Gagan Agrawal 
wrote:

> Based on the suggestions in this mail thread, I tried out few experiments
> on upsert stream with flink 1.7.1 and here is the issue I am facing with
> window stream.
>
> *1. Global Pending order count. *
> Following query works fine and it's able to handle updates as per original
> requirement.
>
> select userId, count(orderId) from
> (select orderId, lastValue(userId) as userId, lastValue(status) as status
> from orders group by orderId)
> where status='PENDING' group by userId
>
> *2. Last 1 Hour tumbling window count (Append stream)*
> Though following query doesn't handle upsert stream, I just tried to make
> sure time column is working fine. This is working, but as expected, it
> doesn't handle updates on orderId.
>
> select userId, count(orderId) from orders
> where status='PENDING' group by TUMBLE(orderTime, INTERVAL '1' HOUR),
> userId
>
> 3. *Last 1 Hour tumbling window count (With upsert stream)*
> Now I tried combination of above two where input stream is converted to
> upsert stream (via lastValue aggregate function) and then Pending count
> needs to be calculated in last 1 hour window.
>
> select userId, count(orderId) from
> (select orderId, orderTime, lastValue(userId) as userId, lastValue(status)
> as status from orders group by orderId, orderTime)
> where status='PENDING' group by TUMBLE(orderTime, INTERVAL '1' HOUR),
> userId
>
> This one gives me following error. Is this because I have added orderTime
> in group by/select clause and hence it's time characteristics have changed?
> What is the workaround here as without adding orderTime, I can not perform
> window aggregation on upsert stream.
>
> [error] Exception in thread "main"
> org.apache.flink.table.api.ValidationException:* Window can only be
> defined over a time attribute column.*
> [error] at
> org.apache.flink.table.plan.rules.datastream.DataStreamLogicalWindowAggregateRule.getOperandAsTimeIndicator$1(DataStreamLogicalWindowAggregateRule.scala:84)
> [error] at
> org.apache.flink.table.plan.rules.datastream.DataStreamLogicalWindowAggregateRule.translateWindowExpression(DataStreamLogicalWindowAggregateRule.scala:89)
> [error] at
> org.apache.flink.table.plan.rules.common.LogicalWindowAggregateRule.onMatch(LogicalWindowAggregateRule.scala:65)
> [error] at
> org.apache.calcite.plan.AbstractRelOptPlanner.fireRule(AbstractRelOptPlanner.java:315)
> [error] at
> org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:556)
> [error] at
> org.apache.calcite.plan.hep.HepPlanner.applyRules(HepPlanner.java:415)
> [error] at
> org.apache.calcite.plan.hep.HepPlanner.executeInstruction(HepPlanner.java:252)
> [error] at
> org.apache.calcite.plan.hep.HepInstruction$RuleInstance.execute(HepInstruction.java:127)
> [error] at
> org.apache.calcite.plan.hep.HepPlanner.executeProgram(HepPlanner.java:211)
> [error] at
> org.apache.calcite.plan.hep.HepPlanner.findBestExp(HepPlanner.java:198)
> [error] at
> org.apache.flink.table.api.TableEnvironment.runHepPlanner(TableEnvironment.scala:360)
> [error] at
> org.apache.flink.table.api.TableEnvironment.runHepPlannerSequentially(TableEnvironment.scala:326)
> [error] at
> org.apache.flink.table.api.TableEnvironment.optimizeNormalizeLogicalPlan(TableEnvironment.scala:282)
> [error] at
> org.apache.flink.table.api.StreamTableEnvironment.optimize(StreamTableEnvironment.scala:811)
> [error] at
> org.apache.flink.table.api.StreamTableEnvironment.translate(StreamTableEnvironment.scala:860)
> [error] at
> org.apache.flink.table.api.scala.StreamTableEnvironment.toRetractStream(StreamTableEnvironment.scala:205)
> [error] at
> org.apache.flink.table.api.scala.StreamTableEnvironment.toRetractStream(StreamTableEnvironment.scala:185)
>
> Gagan
>
> On Tue, Jan 22, 2019 at 7:01 PM Gagan Agrawal 
> wrote:
>
>> Thanks Hequn for your response. I initially thought of trying out "over
>> window" clause, however as per documentation there seems to be limitation
>> in "orderBy" clause where it allows only single time event/processing time
>> attribute. Whereas in my 

Re: TimeZone shift problem in Flink SQL

2019-01-25 Thread Rong Rong
Hi Henry,

Unix epoch time values are always under GMT timezone, for example:
- 1548162182001 <=> GMT: Tuesday, January 22, 2019 1:03:02.001 PM, or CST:
Tuesday, January 22, 2019 9:03:02.001 PM.
- 1548190982001 <=> GMT: Tuesday, January 22, 2019 9:03:02.001 PM, or CST:
Wednesday, January 23, 2019 4:03:02.001 AM.

several things are needed here
1. your "unix_timestamp" UDF should return actual Unix epoch time [1].
2. as Bowen mentioned, you will have to pass in the desired timezone as
argument to your "from_unixtime" UDF.

--
Rong

[1]: https://en.wikipedia.org/wiki/Unix_time

On Thu, Jan 24, 2019 at 4:43 PM Bowen Li  wrote:

> Hi,
>
> Did you consider timezone in conversion in your UDF?
>
>
> On Tue, Jan 22, 2019 at 5:29 AM 徐涛  wrote:
>
>> Hi Experts,
>> I have the following two UDFs,
>> unix_timestamp:   transform from string to Timestamp, with the
>> arguments (value:String, format:String), return Timestamp
>>from_unixtime:transform from Timestamp to String, with the
>> arguments (ts:Long, format:String), return String
>>
>>
>> select
>>  number,
>>  ts,
>>  from_unixtime(unix_timestamp(LAST_UPDATE_TIME, 'EEE MMM dd
>> HH:mm:Ss z '),'-MM-dd')  as dt
>>   from
>>  test;
>>
>>  when the LAST_UPDATE_TIME value is "Tue Jan 22 21:03:12 CST 2019”,
>> the unix_timestamp return a Timestamp with value 1548162182001.
>>   but when from_unixtime is invoked, the timestamp with
>> value 1548190982001 is passed in, there are 8 hours shift between them.
>>   May I know why there are 8 hours shift between them, and how can I
>> get the timestamp that are passed out originally from the first UDF without
>> changing the code?
>>   Thanks very much.
>>
>> Best
>> Henry
>>
>


Re: Query on retract stream

2019-01-25 Thread Gagan Agrawal
Based on the suggestions in this mail thread, I tried out few experiments
on upsert stream with flink 1.7.1 and here is the issue I am facing with
window stream.

*1. Global Pending order count. *
Following query works fine and it's able to handle updates as per original
requirement.

select userId, count(orderId) from
(select orderId, lastValue(userId) as userId, lastValue(status) as status
from orders group by orderId)
where status='PENDING' group by userId

*2. Last 1 Hour tumbling window count (Append stream)*
Though following query doesn't handle upsert stream, I just tried to make
sure time column is working fine. This is working, but as expected, it
doesn't handle updates on orderId.

select userId, count(orderId) from orders
where status='PENDING' group by TUMBLE(orderTime, INTERVAL '1' HOUR), userId

3. *Last 1 Hour tumbling window count (With upsert stream)*
Now I tried combination of above two where input stream is converted to
upsert stream (via lastValue aggregate function) and then Pending count
needs to be calculated in last 1 hour window.

select userId, count(orderId) from
(select orderId, orderTime, lastValue(userId) as userId, lastValue(status)
as status from orders group by orderId, orderTime)
where status='PENDING' group by TUMBLE(orderTime, INTERVAL '1' HOUR), userId

This one gives me following error. Is this because I have added orderTime
in group by/select clause and hence it's time characteristics have changed?
What is the workaround here as without adding orderTime, I can not perform
window aggregation on upsert stream.

[error] Exception in thread "main"
org.apache.flink.table.api.ValidationException:* Window can only be defined
over a time attribute column.*
[error] at
org.apache.flink.table.plan.rules.datastream.DataStreamLogicalWindowAggregateRule.getOperandAsTimeIndicator$1(DataStreamLogicalWindowAggregateRule.scala:84)
[error] at
org.apache.flink.table.plan.rules.datastream.DataStreamLogicalWindowAggregateRule.translateWindowExpression(DataStreamLogicalWindowAggregateRule.scala:89)
[error] at
org.apache.flink.table.plan.rules.common.LogicalWindowAggregateRule.onMatch(LogicalWindowAggregateRule.scala:65)
[error] at
org.apache.calcite.plan.AbstractRelOptPlanner.fireRule(AbstractRelOptPlanner.java:315)
[error] at
org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:556)
[error] at
org.apache.calcite.plan.hep.HepPlanner.applyRules(HepPlanner.java:415)
[error] at
org.apache.calcite.plan.hep.HepPlanner.executeInstruction(HepPlanner.java:252)
[error] at
org.apache.calcite.plan.hep.HepInstruction$RuleInstance.execute(HepInstruction.java:127)
[error] at
org.apache.calcite.plan.hep.HepPlanner.executeProgram(HepPlanner.java:211)
[error] at
org.apache.calcite.plan.hep.HepPlanner.findBestExp(HepPlanner.java:198)
[error] at
org.apache.flink.table.api.TableEnvironment.runHepPlanner(TableEnvironment.scala:360)
[error] at
org.apache.flink.table.api.TableEnvironment.runHepPlannerSequentially(TableEnvironment.scala:326)
[error] at
org.apache.flink.table.api.TableEnvironment.optimizeNormalizeLogicalPlan(TableEnvironment.scala:282)
[error] at
org.apache.flink.table.api.StreamTableEnvironment.optimize(StreamTableEnvironment.scala:811)
[error] at
org.apache.flink.table.api.StreamTableEnvironment.translate(StreamTableEnvironment.scala:860)
[error] at
org.apache.flink.table.api.scala.StreamTableEnvironment.toRetractStream(StreamTableEnvironment.scala:205)
[error] at
org.apache.flink.table.api.scala.StreamTableEnvironment.toRetractStream(StreamTableEnvironment.scala:185)

Gagan

On Tue, Jan 22, 2019 at 7:01 PM Gagan Agrawal 
wrote:

> Thanks Hequn for your response. I initially thought of trying out "over
> window" clause, however as per documentation there seems to be limitation
> in "orderBy" clause where it allows only single time event/processing time
> attribute. Whereas in my case events are getting generated from mysql bin
> log where I have seen multiple event updates getting generated with same
> timestamp (may be because they are part of same transaction) and hence will
> need bin log offset along with timestamp to be able to sort them correctly.
> So looks like can't use "over window" until it allows multiple columns in
> "orderBy". I am exploring option of creating my own window as you suggested
> to be more flexible.
>
> Gagan
>
> On Tue, Jan 22, 2019 at 7:23 AM Hequn Cheng  wrote:
>
>> Hi Gagan,
>>
>> > But I also have a requirement for event time based sliding window
>> aggregation
>>
>> Yes, you can achieve this with Flink TableAPI/SQL. However, currently,
>> sliding windows don't support early fire, i.e., only output results when
>> event time reaches the end of the window. Once window fires, the window
>> state will be cleared and late data belonging to this window will be
>> ignored. In order to wait for the late event, you can extract

Re: How test and validate a data stream software?

2019-01-25 Thread Congxian Qiu
Hi, Alexandre

Maybe the blog post[1] can be helpful.

[1] https://www.da-platform.com/blog/extending-the-yahoo-streaming-benchmark

Alexandre Strapacao Guedes Vianna  于2019年1月23日周三 下午9:54写道:

> Hello People,
>
> I'm conducting a study for my PhD about applications using data stream
> processing, and I would like to investigate de following questions:
>
>
>- How test and validate a data stream software?
>- Is there specific testing frameworks, tools, or testing environments?
>- What are the strategies for generating test data?
>
>
> Please, may you help me explaining how you are testing and validating your
> stream-based/oriented software?
>
> Regards Alexandre
>


-- 
Best,
Congxian


Print table contents

2019-01-25 Thread Soheil Pourbafrani
Hi,

Using Flink Table object how can we print table contents, something like
Spark show() method?

for example in the following:

tableEnv.registerDataSet("Orders", raw, "id, country, num, about");
Table results = tableEnv.sqlQuery("SELECT id FROM Orders WHERE id > 10");

How can I print the results variable contents?

Thanks


AssertionError: mismatched type $5 TIMESTAMP(3)

2019-01-25 Thread Chris Miller
I'm trying to group some data and then enrich it by joining with a 
temporal table function, however my test code (attached) is failing with 
the error shown below. Can someone please give me a clue as to what I'm 
doing wrong?


Exception in thread "main" java.lang.AssertionError: mismatched type $5 
TIMESTAMP(3)
at 
org.apache.calcite.rex.RexUtil$FixNullabilityShuttle.visitInputRef(RexUtil.java:2481)
at 
org.apache.calcite.rex.RexUtil$FixNullabilityShuttle.visitInputRef(RexUtil.java:2459)

at org.apache.calcite.rex.RexInputRef.accept(RexInputRef.java:112)
at org.apache.calcite.rex.RexShuttle.visitList(RexShuttle.java:151)
at org.apache.calcite.rex.RexShuttle.visitCall(RexShuttle.java:100)
at org.apache.calcite.rex.RexShuttle.visitCall(RexShuttle.java:34)
at org.apache.calcite.rex.RexCall.accept(RexCall.java:107)
at org.apache.calcite.rex.RexShuttle.apply(RexShuttle.java:279)
at org.apache.calcite.rex.RexShuttle.mutate(RexShuttle.java:241)
at org.apache.calcite.rex.RexShuttle.apply(RexShuttle.java:259)
at org.apache.calcite.rex.RexUtil.fixUp(RexUtil.java:1605)
at 
org.apache.calcite.rel.rules.FilterJoinRule.perform(FilterJoinRule.java:230)
at 
org.apache.calcite.rel.rules.FilterJoinRule$FilterIntoJoinRule.onMatch(FilterJoinRule.java:344)
at 
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:212)
at 
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:646)
at 
org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:339)
at 
org.apache.flink.table.api.TableEnvironment.runVolcanoPlanner(TableEnvironment.scala:374)
at 
org.apache.flink.table.api.TableEnvironment.optimizeLogicalPlan(TableEnvironment.scala:292)
at 
org.apache.flink.table.api.StreamTableEnvironment.optimize(StreamTableEnvironment.scala:812)
at 
org.apache.flink.table.api.StreamTableEnvironment.translate(StreamTableEnvironment.scala:860)
at 
org.apache.flink.table.api.java.StreamTableEnvironment.toRetractStream(StreamTableEnvironment.scala:340)
at 
org.apache.flink.table.api.java.StreamTableEnvironment.toRetractStream(StreamTableEnvironment.scala:272)

at test.Test.main(Test.java:78)

Test.java
Description: Binary data


Re: `env.java.opts` not persisting after job canceled or failed and then restarted

2019-01-25 Thread Aaron Levin
I don't control the code calling `System.loadLibrary("hadoop")` so that's
not an option for me, unfortunately.

On Thu, Jan 24, 2019 at 7:47 PM Guowei Ma  wrote:

> This may be caused by a  jvm process can only load a so once.So a triky
> way is to rename it。
>
> 发自我的 iPhone
>
> 在 2019年1月25日,上午7:12,Aaron Levin  写道:
>
> Hi Ufuk,
>
> Update: I've pinned down the issue. It's multiple classloaders loading
> `libhadoop.so`:
>
> ```
> failed to load native hadoop with error: java.lang.UnsatisfiedLinkError:
> Native Library /usr/lib/libhadoop.so already loaded in another classloader
> ```
>
> I'm not quite sure what the solution is. Ideally flink would destroy a
> classloader when a job is canceled, but perhaps there's a jvm limitation
> there? Putting the libraries into `/usr/lib` or `/lib` does not work (as
> suggested by Chesnay in the ticket) as I get the same error. I might see if
> I can put a jar with `org.apache.hadoop.common.io.compress` in
> `/flink/install/lib` and then remove it from my jar. It's not an ideal
> solution but I can't think of anything else.
>
> Best,
>
> Aaron Levin
>
> On Thu, Jan 24, 2019 at 12:18 PM Aaron Levin 
> wrote:
>
>> Hi Ufuk,
>>
>> I'm starting to believe the bug is much deeper than the originally
>> reported error because putting the libraries in `/usr/lib` or `/lib` does
>> not work. This morning I dug into why putting `libhadoop.so` into
>> `/usr/lib` didn't work, despite that being in the `java.library.path` at
>> the call site of the error. I wrote a small program to test the loading of
>> native libraries, and it was able to successfully load `libhadoop.so`. I'm
>> very perplexed. Could this be related to the way flink shades hadoop stuff?
>>
>> Here is my program and its output:
>>
>> ```
>> $ cat LibTest.scala
>> package com.redacted.flink
>>
>> object LibTest {
>>   def main(args: Array[String]): Unit = {
>> val library = args(0)
>>
>> System.out.println(s"java.library.path=${System.getProperty("java.library.path")}")
>> System.out.println(s"Attempting to load $library")
>> System.out.flush()
>> System.loadLibrary(library)
>> System.out.println(s"Successfully loaded ")
>> System.out.flush()
>> }
>> ```
>>
>> I then tried running that on one of the task managers with `hadoop` as an
>> argument:
>>
>> ```
>> $ java -jar lib_test_deploy.jar hadoop
>>
>> java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
>> Attempting to load hadoop
>> Exception in thread "main" java.lang.UnsatisfiedLinkError: no hadoop in
>> java.library.path
>> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>> at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>> at java.lang.System.loadLibrary(System.java:1122)
>> at com.stripe.flink.LibTest$.main(LibTest.scala:11)
>> at com.stripe.flink.LibTest.main(LibTest.scala)
>> ```
>>
>> I then copied the native libraries into `/usr/lib/` and ran it again:
>>
>> ```
>> $ sudo cp /usr/local/hadoop/lib/native/libhadoop.so /usr/lib/
>> $ java -jar lib_test_deploy.jar hadoop
>>
>> java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
>> Attempting to load hadoop
>> Successfully loaded
>> ```
>>
>> Any ideas?
>>
>> On Wed, Jan 23, 2019 at 7:13 PM Aaron Levin 
>> wrote:
>>
>>> Hi Ufuk,
>>>
>>> One more update: I tried copying all the hadoop native `.so` files
>>> (mainly `libhadoop.so`) into `/lib` and am I still experiencing the issue I
>>> reported. I also tried naively adding the `.so` files to the jar with the
>>> flink application and am still experiencing the issue I reported (however,
>>> I'm going to investigate this further as I might not have done it
>>> correctly).
>>>
>>> Best,
>>>
>>> Aaron Levin
>>>
>>> On Wed, Jan 23, 2019 at 3:18 PM Aaron Levin 
>>> wrote:
>>>
 Hi Ufuk,

 Two updates:

 1. As suggested in the ticket, I naively copied the every `.so` in
 `hadoop-3.0.0/lib/native/` into `/lib/` and this did not seem to help. My
 knowledge of how shared libs get picked up is hazy, so I'm not sure if
 blindly copying them like that should work. I did check what
 `System.getProperty("java.library.path")` returns at the call-site and
 it's: 
 java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
 2. The exception I see comes from
 `hadoop.util.NativeCodeLoader.buildSupportsSnappy` (stack-trace below).
 This uses `System.loadLibrary("hadoop")`.

 [2019-01-23 19:52:33.081216] java.lang.UnsatisfiedLinkError:
 org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
 [2019-01-23 19:52:33.081376]  at
 org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method)
 [2019-01-23 19:52:33.081406]  at
 org.apache.hadoop.io.compress.Snapp

RE: [Flink 1.6] How to get current total number of processed events

2019-01-25 Thread Thanh-Nhan Vo
Hi Kien,

Thanks you so much for you answer !
Regards,
Nhan

De : Kien Truong [mailto:duckientru...@gmail.com]
Envoyé : vendredi 25 janvier 2019 13:47
À : Thanh-Nhan Vo ; user@flink.apache.org
Objet : Re: [Flink 1.6] How to get current total number of processed events


Hi Nhan,

To get a global view over all events, you can use a non-keyed TumblingWindow 
and a ProcessAllWindowFunction.

Inside the ProcessAllWindowFunction, you calculate the min/max/count of the 
elements of that window,

compared them to the existing values in the global state, then update the new 
min/max/count to global state to use in the next window.

You can also get the min/max/count downstream by emitting it together with the 
window's item.



Do note that non-keyed Window always run with a parallelism of 1, so it might 
create a hotspot/bottleneck in your stream.



Regards,

Kien


On 1/25/2019 3:17 PM, Thanh-Nhan Vo wrote:
Hi Kien,

Thank you for your answer.

Please correct me if I'm wrong. If I understand well, if I store the max/min 
value using the value states of a KeyedProcessFunction, this max/min value is 
calculated  per key?

Note that in my case, I expect that at every instant,  I can obtain the 
maximum/minimum number of processed messages for all keys. For example:



-Input datastream : [ message1(k1, v1)  messages2(k2,v2)  message3(k1, 
v3)  message4(k4, v4)  message5(k1, v5) message6(k2, v6)  message7(k7, v7)]



-When processing message7(k7, v7), I expect to obtain:



o   Maximum number of processed messages: 3 (corresponding to key k1)

o   Minimum number of processed messages: 1 (corresponding to keys 4 and 7)

Do you have any idea to obtain this, please?

Thank you so much !

Nhan
De : Kien Truong [mailto:duckientru...@gmail.com]
Envoyé : jeudi 24 janvier 2019 12:45
À : Thanh-Nhan Vo 
; 
user@flink.apache.org
Objet : Re: [Flink 1.6] How to get current total number of processed events


Hi Nhan,

You can store the max/min value using the value states of a 
KeyedProcessFunction,

or in the global state of a ProcessWindowFunction.



On processing each item, compare its value to the current max/min and update 
the stored value as needed.



Regards,

Kien


On 1/24/2019 12:37 AM, Thanh-Nhan Vo wrote:
Hi Kien Truong,

Thank you for your answer. I have another question, please !
If I count the number of messages processed for a given key j (denoted c_j), is 
there a way to retrieve max{c_j}, min{c_j}?

Thanks

De : Kien Truong [mailto:duckientru...@gmail.com]
Envoyé : mercredi 23 janvier 2019 16:04
À : user@flink.apache.org
Objet : Re: [Flink 1.6] How to get current total number of processed events


Hi Nhan,

Logically, the total number of processed events before an event cannot be 
accurately calculated unless events processing are synchronized.

This is not scalable, so naturally I don't think Flink supports it.

Although, I suppose you can get an approximate count by using a non-keyed 
TumblingWindow, count the item inside the window, then use that value in the 
next window.



Regards,

Kien


On 1/21/2019 9:34 PM, Thanh-Nhan Vo wrote:
Hello all,

I have a question, please !
I'm using Flink 1.6 to process our data in streaming mode.
I wonder if at a given event, there is a way to get the current total number of 
processed events (before this event).
If possible, I want to get this total number of processed events as a value 
state in Keystream.
It means that for a given key in KeyStream, I want to retrieve not only the 
total number of processed events for this key but also the total number of 
processed events for all keys.

There is a way to do this in Flink 1.6, please!
Best regard,
Nhan



Re: [Flink 1.6] How to get current total number of processed events

2019-01-25 Thread Kien Truong

Hi Nhan,

To get a global view over all events, you can use a non-keyed 
TumblingWindow and a ProcessAllWindowFunction.


Inside the ProcessAllWindowFunction, you calculate the min/max/count of 
the elements of that window,


compared them to the existing values in the global state, then update 
the new min/max/count to global state to use in the next window.


You can also get the min/max/count downstream by emitting it together 
with the window's item.



Do note that non-keyed Window always run with a parallelism of 1, so it 
might create a hotspot/bottleneck in your stream.



Regards,

Kien


On 1/25/2019 3:17 PM, Thanh-Nhan Vo wrote:


Hi Kien,

Thank you for your answer.

Please correct me if I’m wrong. If I understand well, if I store the 
max/min value using the value states of a KeyedProcessFunction, this 
max/min value is calculated  per key?


Note that in my case, I expect that at every instant,  I can obtain 
the maximum/minimum number of processed messages for all keys. For 
example:


-Input datastream : [ message1(k1, v1) messages2(k2,v2)  message3(k1, 
v3)  message4(k4, v4) message5(k1, v5) message6(k2, v6)  message7(k7, v7)]


-When processing message7(k7, v7), I expect to obtain:

oMaximum number of processed messages: 3 (corresponding to key k1)

oMinimum number of processed messages: 1 (corresponding to keys 4 and 7)

Do you have any idea to obtain this, please?

Thank you so much !

Nhan

*De :* Kien Truong [mailto:duckientru...@gmail.com]
*Envoyé :* jeudi 24 janvier 2019 12:45
*À :* Thanh-Nhan Vo ; user@flink.apache.org
*Objet :* Re: [Flink 1.6] How to get current total number of processed 
events


Hi Nhan,

You can store the max/min value using the value states of a 
KeyedProcessFunction,


or in the global state of a ProcessWindowFunction.

On processing each item, compare its value to the current max/min and 
update the stored value as needed.


Regards,

Kien

On 1/24/2019 12:37 AM, Thanh-Nhan Vo wrote:

Hi Kien Truong,


Thank you for your answer. I have another question, please !
If I count the number of messages processed for a given key j
(denoted c_j), is there a way to retrieve max{c_j}, min{c_j}?

Thanks

*De :* Kien Truong [mailto:duckientru...@gmail.com]
*Envoyé :* mercredi 23 janvier 2019 16:04
*À :* user@flink.apache.org 
*Objet :* Re: [Flink 1.6] How to get current total number of
processed events

Hi Nhan,

Logically, the total number of processed events before an event
cannot be accurately calculated unless events processing are
synchronized.

This is not scalable, so naturally I don't think Flink supports it.

Although, I suppose you can get an approximate count by using a
non-keyed TumblingWindow, count the item inside the window, then
use that value in the next window.

Regards,

Kien

On 1/21/2019 9:34 PM, Thanh-Nhan Vo wrote:

Hello all,

I have a question, please !
I’m using Flink 1.6 to process our data in streaming mode.
I wonder if at a given event, there is a way to get the
current total number of processed events (before this event).

If possible, I want to get this total number of processed
events as a value state in Keystream.
It means that for a given key in KeyStream, I want to retrieve
not only the total number of processed events for this key but
also the total number of processed events for all keys.

There is a way to do this in Flink 1.6, please!

Best regard,
Nhan



Re: No resource available error while testing HA

2019-01-25 Thread Averell
Hi Gary,

Yes, my problem mentioned in the original post had been resolved by
correcting the zookeeper connection string.

I have two other relevant questions, if you have time, please help:

1. Regarding JM high availability, when I shut down the host having JM
running, YARN would detect that missing JM and start a new one after 10
minutes, and the Flink job would be restored. However, on the console screen
that I submitted the job, I got the following error messages: "/The program
finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException/" (full stack
trace in the attached file  flink_console_timeout.log

 
)
Is there any way to avoid this? As if I run this as an AWS EMR job, the job
would be considered failed, while it is actually be restored automatically
by YARN after 10 minutes).

2. Regarding logging, could you please help explain about the source of the
error messages show in "Exception" tab on Flink Job GUI (as per the
screenshot below). I could not find any log files has that message (not in
jobmanager.log or in taskmanager.log in EMR's hadoop-yarn logs folder).

 

Thanks and regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Is there a way to get all flink build-in SQL functions

2019-01-25 Thread Timo Walther
The problem right now is that Flink SQL has two stacks for defining 
functions. One is the built-in function stack that is based on Calcite 
and the other are the registered UDFs.


What you can do is to use 
FunctionCatalog.withBuiltIns.getSqlOperatorTable() for listing Calcite 
built-in functions and TableEnvironment.listFunctions() for registered UDFs.


I hope this helps.

Regards,
Timo

Am 25.01.19 um 10:17 schrieb Jeff Zhang:
I believe it make sense to list available udf programmatically. e.g. 
Users may want to see available udfs in sql-client. It would also 
benefit other downstream project that use flink sql. Besides that I 
think flink should also provide api for querying the description of 
udf about how to use it.


yinhua.dai mailto:yinhua.2...@outlook.com>> 
于2019年1月25日周五 下午5:12写道:


Thanks Guys.
I just wondering if there is another way except hard code the list:)
Thanks anyway.



--
Sent from:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



--
Best Regards

Jeff Zhang





Re: Is there a way to get all flink build-in SQL functions

2019-01-25 Thread Jeff Zhang
I believe it make sense to list available udf programmatically. e.g. Users
may want to see available udfs in sql-client. It would also benefit other
downstream project that use flink sql. Besides that I think flink should
also provide api for querying the description of udf about how to use it.

yinhua.dai  于2019年1月25日周五 下午5:12写道:

> Thanks Guys.
> I just wondering if there is another way except hard code the list:)
> Thanks anyway.
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>


-- 
Best Regards

Jeff Zhang


Re: [Flink 1.6] How to get current total number of processed events

2019-01-25 Thread Congxian Qiu
Hi, Nhan
There is only one way I know to sum up all the parallel operator instances:
set parallel to 1.

Best,
Congxian

Thanh-Nhan Vo  于2019年1月25日周五 下午4:38写道:

> Hi Congixan Wiu,
>
> Thank you for your answer.
>
> If I understand well, each operator state is bound to one parallel
> operator instance.
> Indeed, I expect to get the total number of all parallel operator
> instances.
>
> Is there a way to sum up all these operator states , please?
>
> Best regard,
> Nhan
>
>
>
> *De :* Congxian Qiu [mailto:qcx978132...@gmail.com]
> *Envoyé :* vendredi 25 janvier 2019 07:30
> *À :* Kien Truong 
> *Cc :* Thanh-Nhan Vo ; user@flink.apache.org
> *Objet :* Re: [Flink 1.6] How to get current total number of processed
> events
>
>
>
> Hi, Nhan
>
> Do you want the total number of the current parallelism or the operator?
> If you want the total number of the current parallelism, Is the operator
> state[1] satisfied with your use case?
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/state.html#operator-state
>
>
>
> Kien Truong  于2019年1月24日周四 下午7:45写道:
>
> Hi Nhan,
>
> You can store the max/min value using the value states of a
> KeyedProcessFunction,
>
> or in the global state of a ProcessWindowFunction.
>
>
>
> On processing each item, compare its value to the current max/min and
> update the stored value as needed.
>
>
>
> Regards,
>
> Kien
>
>
>
> On 1/24/2019 12:37 AM, Thanh-Nhan Vo wrote:
>
> Hi Kien Truong,
>
>
> Thank you for your answer. I have another question, please !
> If I count the number of messages processed for a given key j (denoted
> c_j), is there a way to retrieve max{c_j}, min{c_j}?
>
> Thanks
>
>
>
> *De :* Kien Truong [mailto:duckientru...@gmail.com
> ]
> *Envoyé :* mercredi 23 janvier 2019 16:04
> *À :* user@flink.apache.org
> *Objet :* Re: [Flink 1.6] How to get current total number of processed
> events
>
>
>
> Hi Nhan,
>
> Logically, the total number of processed events before an event cannot be
> accurately calculated unless events processing are synchronized.
>
> This is not scalable, so naturally I don't think Flink supports it.
>
> Although, I suppose you can get an approximate count by using a non-keyed
> TumblingWindow, count the item inside the window, then use that value in
> the next window.
>
>
>
> Regards,
>
> Kien
>
>
>
> On 1/21/2019 9:34 PM, Thanh-Nhan Vo wrote:
>
> Hello all,
>
> I have a question, please !
> I’m using Flink 1.6 to process our data in streaming mode.
> I wonder if at a given event, there is a way to get the current total
> number of processed events (before this event).
>
> If possible, I want to get this total number of processed events as a
> value state in Keystream.
> It means that for a given key in KeyStream, I want to retrieve not only
> the total number of processed events for this key but also the total number
> of processed events for all keys.
>
> There is a way to do this in Flink 1.6, please!
>
> Best regard,
> Nhan
>
>
>
>
>
>
> --
>
> Best,
>
> Congxian
>


-- 
Best,
Congxian


Re: Is there a way to get all flink build-in SQL functions

2019-01-25 Thread yinhua.dai
Thanks Guys.
I just wondering if there is another way except hard code the list:)
Thanks anyway.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


RE: [Flink 1.6] How to get current total number of processed events

2019-01-25 Thread Thanh-Nhan Vo
Hi Congixan Wiu,

Thank you for your answer.
If I understand well, each operator state is bound to one parallel operator 
instance.
Indeed, I expect to get the total number of all parallel operator instances.
Is there a way to sum up all these operator states , please?

Best regard,
Nhan

De : Congxian Qiu [mailto:qcx978132...@gmail.com]
Envoyé : vendredi 25 janvier 2019 07:30
À : Kien Truong 
Cc : Thanh-Nhan Vo ; user@flink.apache.org
Objet : Re: [Flink 1.6] How to get current total number of processed events

Hi, Nhan
Do you want the total number of the current parallelism or the operator? If you 
want the total number of the current parallelism, Is the operator state[1] 
satisfied with your use case?

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/state.html#operator-state

Kien Truong mailto:duckientru...@gmail.com>> 
于2019年1月24日周四 下午7:45写道:

Hi Nhan,

You can store the max/min value using the value states of a 
KeyedProcessFunction,

or in the global state of a ProcessWindowFunction.



On processing each item, compare its value to the current max/min and update 
the stored value as needed.



Regards,

Kien


On 1/24/2019 12:37 AM, Thanh-Nhan Vo wrote:
Hi Kien Truong,

Thank you for your answer. I have another question, please !
If I count the number of messages processed for a given key j (denoted c_j), is 
there a way to retrieve max{c_j}, min{c_j}?

Thanks

De : Kien Truong [mailto:duckientru...@gmail.com]
Envoyé : mercredi 23 janvier 2019 16:04
À : user@flink.apache.org
Objet : Re: [Flink 1.6] How to get current total number of processed events


Hi Nhan,

Logically, the total number of processed events before an event cannot be 
accurately calculated unless events processing are synchronized.

This is not scalable, so naturally I don't think Flink supports it.

Although, I suppose you can get an approximate count by using a non-keyed 
TumblingWindow, count the item inside the window, then use that value in the 
next window.



Regards,

Kien


On 1/21/2019 9:34 PM, Thanh-Nhan Vo wrote:
Hello all,

I have a question, please !
I’m using Flink 1.6 to process our data in streaming mode.
I wonder if at a given event, there is a way to get the current total number of 
processed events (before this event).
If possible, I want to get this total number of processed events as a value 
state in Keystream.
It means that for a given key in KeyStream, I want to retrieve not only the 
total number of processed events for this key but also the total number of 
processed events for all keys.

There is a way to do this in Flink 1.6, please!
Best regard,
Nhan



--
Best,
Congxian


RE: [Flink 1.6] How to get current total number of processed events

2019-01-25 Thread Thanh-Nhan Vo
Hi Kien,

Thank you for your answer.

Please correct me if I'm wrong. If I understand well, if I store the max/min 
value using the value states of a KeyedProcessFunction, this max/min value is 
calculated  per key?

Note that in my case, I expect that at every instant,  I can obtain the 
maximum/minimum number of processed messages for all keys. For example:


-Input datastream : [ message1(k1, v1)  messages2(k2,v2)  message3(k1, 
v3)  message4(k4, v4)  message5(k1, v5) message6(k2, v6)  message7(k7, v7)]



-When processing message7(k7, v7), I expect to obtain:



o   Maximum number of processed messages: 3 (corresponding to key k1)

o   Minimum number of processed messages: 1 (corresponding to keys 4 and 7)

Do you have any idea to obtain this, please?

Thank you so much !

Nhan
De : Kien Truong [mailto:duckientru...@gmail.com]
Envoyé : jeudi 24 janvier 2019 12:45
À : Thanh-Nhan Vo ; user@flink.apache.org
Objet : Re: [Flink 1.6] How to get current total number of processed events


Hi Nhan,

You can store the max/min value using the value states of a 
KeyedProcessFunction,

or in the global state of a ProcessWindowFunction.



On processing each item, compare its value to the current max/min and update 
the stored value as needed.



Regards,

Kien


On 1/24/2019 12:37 AM, Thanh-Nhan Vo wrote:
Hi Kien Truong,

Thank you for your answer. I have another question, please !
If I count the number of messages processed for a given key j (denoted c_j), is 
there a way to retrieve max{c_j}, min{c_j}?

Thanks

De : Kien Truong [mailto:duckientru...@gmail.com]
Envoyé : mercredi 23 janvier 2019 16:04
À : user@flink.apache.org
Objet : Re: [Flink 1.6] How to get current total number of processed events


Hi Nhan,

Logically, the total number of processed events before an event cannot be 
accurately calculated unless events processing are synchronized.

This is not scalable, so naturally I don't think Flink supports it.

Although, I suppose you can get an approximate count by using a non-keyed 
TumblingWindow, count the item inside the window, then use that value in the 
next window.



Regards,

Kien


On 1/21/2019 9:34 PM, Thanh-Nhan Vo wrote:
Hello all,

I have a question, please !
I'm using Flink 1.6 to process our data in streaming mode.
I wonder if at a given event, there is a way to get the current total number of 
processed events (before this event).
If possible, I want to get this total number of processed events as a value 
state in Keystream.
It means that for a given key in KeyStream, I want to retrieve not only the 
total number of processed events for this key but also the total number of 
processed events for all keys.

There is a way to do this in Flink 1.6, please!
Best regard,
Nhan



StreamingFileSink Avro batch size and compression

2019-01-25 Thread Taher Koitawala
Hi All,
 Is there a way to specify *batch size* and *compression *properties
when using StreamingFileSink just like we did in bucketing sink? The only
parameters it is accepting is Inactivity bucket check interval and avro
schema.

  We have numerous flink jobs pulling data from the same kafka
topics, however doing different operations. And each flink job is writing a
file with different size and we would want to make it consistent.


Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163