Re: How can I achieve 'sink.partition-commit.policy.kind'='metastore,success-file' with batch Hive sink?

2021-08-20 Thread Yik San Chan
Hi Jingsong, I have created a JIRA ticket https://issues.apache.org/jira/browse/FLINK-23891. Best, Yik San On Fri, Aug 20, 2021 at 3:32 PM Yik San Chan wrote: > Hi Caizhi, > > Thanks for the work around! It should work fine. > > Hi Jingsong, > > Thanks for the suggesti

Re: How can I achieve 'sink.partition-commit.policy.kind'='metastore,success-file' with batch Hive sink?

2021-08-20 Thread Yik San Chan
11:01 AM Caizhi Weng wrote: > > > > Hi! > > > > As far as I know Flink batch jobs will not add the _SUCCESS file. > However for batch jobs you can register a JobListener and add the _SUCCESS > file by yourself in JobListener#onJobExecuted. See registerJobListener >

How can I achieve 'sink.partition-commit.policy.kind'='metastore,success-file' with batch Hive sink?

2021-08-19 Thread Yik San Chan
Hi community, According to the [docs]( https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#partition-commit-policy), if I create a Hive table with config sink.partition-commit.policy.kind="metastore,success-file", once the write to the **streaming** Hive sink i

Re: redis sink from flink

2021-08-16 Thread Yik San Chan
, 2021 at 10:15 AM Yik San Chan wrote: > Hi Jin, > > I was in the same shoes. I tried bahir redis connector at first, then I > felt it was very limited, so I rolled out my own. It was actually quite > straightforward. > > All you need to do is to extend RichSinkFunction, then pu

Re: redis sink from flink

2021-08-16 Thread Yik San Chan
Hi Jin, I was in the same shoes. I tried bahir redis connector at first, then I felt it was very limited, so I rolled out my own. It was actually quite straightforward. All you need to do is to extend RichSinkFunction, then put your logic inside. Regarding Redis clients, Jedis (https://github.com

Re: How to read large amount of data from hive and write to redis, in a batch manner?

2021-07-08 Thread Yik San Chan
everything in bufferedElements into Redis using Redis pipelining. That's pretty much it. Thank you for your help again! On Thu, Jul 8, 2021 at 11:09 PM Piotr Nowojski wrote: > Great, thanks for coming back and I'm glad that it works for you! > > Piotrek > > czw., 8 lip

My batch source doesn't emit MAX_WATERMARK when it finishes - why?

2021-07-08 Thread Yik San Chan
Hi, According to the docs [1] When a source reaches the end of the input, it emits a final watermark with timestamp Long.MAX_VALUE, indicating the "end of time". However, in my small experiment [2], the Flink job reads from a local csv file, and prints a watermark for each record in the SinkFun

Re: How to read large amount of data from hive and write to redis, in a batch manner?

2021-07-08 Thread Yik San Chan
nting and either always restart the job from scratch or > accept some occasional data loss. > > FYI, virtually every connector/sink is internally batching writes to some > extent. Usually by doing option 1. > > Piotrek > > wt., 25 maj 2021 o 14:50 Yik San Chan > napisał(a):

Re: How can I tell if a record in a bounded job is the last record?

2021-06-30 Thread Yik San Chan
gger all the timers when it terminates (see > [1]). > > [1] > https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java > > Best, > Paul Lam > > 2021年6月30日 10:38,Yik San Chan 写道: > > Hi co

How can I tell if a record in a bounded job is the last record?

2021-06-29 Thread Yik San Chan
Hi community, I have a batch job that consumes records from a bounded source (e.g., Hive), walk them through a BufferingSink as described in [docs]( https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/datastream/fault-tolerance/state/#checkpointedfunction). In the BufferingSink,

How to read large amount of data from hive and write to redis, in a batch manner?

2021-05-25 Thread Yik San Chan
Hi community, I have a Hive table that stores tens of millions rows of data. In my Flink job, I want to process the data in batch manner: - Split the data into batches, each batch has (maybe) 10,000 rows. - For each batch, call a batchPut() API on my redis client to dump in Redis. Doing so in a

When to prefer toDataStream over toAppendStream or toRetractStream?

2021-05-24 Thread Yik San Chan
Hi community, Flink 1.13 introduces toDataStream. However, I wonder when do we prefer toDataStream over toAppendStream or toRetractStream? Thank you! Best, Yik San

Re: Task not serializable when logging in a trait method

2021-05-24 Thread Yik San Chan
cted *static* final val LOG = LoggerFactory.getLogger(getClass) > > Best, > Guowei > > > On Mon, May 24, 2021 at 2:41 PM Yik San Chan > wrote: > >> Hi community, >> >> I have a job that consumes data from a datagen source, tries to log >> something in `map` oper

Task not serializable when logging in a trait method

2021-05-23 Thread Yik San Chan
Hi community, I have a job that consumes data from a datagen source, tries to log something in `map` operators, and sinks the result to a DiscardingSink. The full example can be found in [the repo]( https://github.com/YikSanChan/log-in-flink-operator). The `Job` extends `BaseJob` where `preproces

Re: PyFlink UDF: No match found for function signature XXX

2021-05-18 Thread Yik San Chan
s/overview/ Best, Yik San On Tue, May 18, 2021 at 6:43 PM Yik San Chan wrote: > Hi Dian, > > I changed the udf to: > > ```python > @udf( > input_types=[DataTypes.BIGINT(), DataTypes.BIGINT()], > result_type=DataTypes.BIGINT(), > ) > def add(i, j): > return i +

Re: PyFlink UDF: No match found for function signature XXX

2021-05-18 Thread Yik San Chan
types for add are DataTypes.INT, however, the schema of > aiinfra.mysource is: a bigint and b bigint. > > Regards, > Dian > > 2021年5月18日 下午5:38,Yik San Chan 写道: > > Hi, > > I have a PyFlink script that fails to use a simple UDF. The full script > can be found belo

PyFlink UDF: No match found for function signature XXX

2021-05-18 Thread Yik San Chan
Hi, I have a PyFlink script that fails to use a simple UDF. The full script can be found below: ```python from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import ( DataTypes, EnvironmentSettings, SqlDialect, StreamTableEnvironment, ) from pyflink.table.udf import udf

Re: Flink 1.13.0: check flamegraph on WebUI shows 404 not found

2021-05-12 Thread Yik San Chan
lamegraph feature in the configuration? > > On 5/12/2021 10:51 AM, Yik San Chan wrote: > > Hi community, > > > > Flink 1.13.0 releases flamegraph. However, when I run Flink 1.13.0 > > locally, and try to check the flamegraph of an operator that is > > running,

Flink 1.13.0: check flamegraph on WebUI shows 404 not found

2021-05-12 Thread Yik San Chan
Hi community, Flink 1.13.0 releases flamegraph. However, when I run Flink 1.13.0 locally, and try to check the flamegraph of an operator that is running, I got this "404 not found" error. This is the request and response. Request: curl ' http://localhost:8081/jobs/9f4ba3d39de6b2d9de0ea77ba8f8431a

Re: ValueError: unsupported pickle protocol: 5

2021-05-11 Thread Yik San Chan
Thank you! That answers my question. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: how to split a column value into multiple rows in flink sql?

2021-05-09 Thread Yik San Chan
Hi, Maybe try row-based flatMap operation in table api. It is available on flink 1.13.x https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/tableapi/#flatmap Best, Yik San -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Does Top-N query have to be result-updating, even for batch input?

2021-05-08 Thread Yik San Chan
There turns out to be an easy solution. env_settings = EnvironmentSettings.new_instance().in_batch_mode().use_blink_planner().build() t_env = BatchTableEnvironment.create(environment_settings=env_settings) Use batch mode instead, then the top-n will be only omitting result once, at the end of the

Does Top-N query have to be result-updating, even for batch input?

2021-05-08 Thread Yik San Chan
Hi community, Top-N query is result-updating, according to [docs]( https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sql/queries/topn/ ): > The TopN query is *Result Updating*. Flink SQL will sort the input data stream according to the order key, so if the top N records

PyFlink local mode doesn't give a consistent output on top-k, while standalone mode does

2021-05-08 Thread Yik San Chan
Hi community, I have a job that exercises top-n query as described in https://ci.apache.org/projects/flink/flink-docs-release-1.13/zh/docs/dev/table/sql/queries/topn/. You can find the job in https://github.com/YikSanChan/pyflink-quickstart/blob/26e7c09aaa1167b13b981586ffb0e7d6bb6cf053/topk.py Wh

Re: I call Pandas UDF N times, do I have to initiate the UDF N times?

2021-05-07 Thread Yik San Chan
could optimize. Would you like to create a > ticket for this? > > Regards, > Dian > > 2021年5月8日 下午2:27,Yik San Chan 写道: > > Hi Dian, > > Thanks for pointing that out, it is a work around that I have also > considered. > > I wonder if there is a framework level

Re: I call Pandas UDF N times, do I have to initiate the UDF N times?

2021-05-07 Thread Yik San Chan
wrote: > Hi Yik San, > > Is it acceptable to rewrite the UDF a bit to accept multiple parameters > and then rewrite the program as following: > > ``` > SELECT > LABEL_ENCODE(a, b, c) > ... > ``` > > Regards, > Dian > > 2021年5月8日 上午11:56,Yik San Chan 写道:

I call Pandas UDF N times, do I have to initiate the UDF N times?

2021-05-07 Thread Yik San Chan
Hi community, I am using PyFlink and Pandas UDF in my job. The job executes a SQL like this: ``` SELECT LABEL_ENCODE(a), LABEL_ENCODE(b), LABEL_ENCODE(c) ... ``` And my LABEL_ENCODE UDF is defined below: ``` class LabelEncode(ScalarFunction): def open(self, function_context): logging.inf

Re: How to increase the number of task managers?

2021-05-07 Thread Yik San Chan
; > > > parallelism.default is another config you should consider. > > > > Read also > > > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/parallel/ > > > > > > > > From: Yik

How to increase the number of task managers?

2021-05-07 Thread Yik San Chan
Hi community, According to the [docs]( https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ ): > taskmanager.numberOfTaskSlots: The number of slots that a TaskManager offers *(default: 1)*. Each slot can take one task or pipeline. Having multiple slots in a TaskMan

Re: How to tell between a local mode run vs. remote mode run?

2021-05-05 Thread Yik San Chan
gateway.jvm.org.apache.flink.streaming.api.environment.LocalStreamEnvironment) > print(env_class == local_stream_environment_class) > > > if __name__ == '__main__': > test() > > ``` > > Yik San Chan 于2021年5月5日周三 下午12:04写道: > >> Hi, >> >&g

How to tell between a local mode run vs. remote mode run?

2021-05-04 Thread Yik San Chan
Hi, According to https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/python/faq/ > When executing jobs in mini cluster(e.g. when executing jobs in IDE) ... please remember to explicitly wait for the job execution to finish as these APIs are asynchronous. I hope my program will

Re: [Announcement] Analytics Zoo 0.10.0 release

2021-04-28 Thread Yik San Chan
Hi Jason, Thanks for sharing. I look up the term "Flink" on https://analytics-zoo.readthedocs.io/en/latest/ but it doesn't even exist. Do you mind sharing how does it relate to Flink users? Best, Yik San On Wed, Apr 28, 2021 at 10:48 PM Jason Dai wrote: > Hi Everyone, > > > I’m happy to announ

Re: ModuleNotFound when loading udf from another py file

2021-04-27 Thread Yik San Chan
Hi Dian, I follow up with this PR https://github.com/apache/flink/pull/15790 On Tue, Apr 27, 2021 at 11:03 PM Dian Fu wrote: > Hi Yik San, > > Make sense to me. :) > > Regards, > Dian > > 2021年4月27日 下午9:50,Yik San Chan 写道: > > Hi Dian, > > Wow, th

Re: ModuleNotFound when loading udf from another py file

2021-04-27 Thread Yik San Chan
ss the whole Python UDF implementation will be > serialized. However, for the previous case, only the path of the class is > serialized. > > Regards, > Dian > > 2021年4月27日 下午8:52,Yik San Chan 写道: > > Hi Dian, > > Thanks! Adding -pyfs definitely helps. > >

Re: ModuleNotFound when loading udf from another py file

2021-04-27 Thread Yik San Chan
s command line arguments. Otherwise, this > file will not be available during execution. > > Regards, > Dian > > 2021年4月27日 下午8:01,Yik San Chan 写道: > > Hi, > > Here's the reproducible code sample: > https://github.com/YikSanChan/pyflink-quickstart/tree/83526abc

ModuleNotFound when loading udf from another py file

2021-04-27 Thread Yik San Chan
Hi, Here's the reproducible code sample: https://github.com/YikSanChan/pyflink-quickstart/tree/83526abca832f9ed5b8ce20be52fd506c45044d3 I implement my Python UDF by extending the ScalarFunction base class in a separate file named decrypt_fun.py, and try to import the udf into my main python file

Re: How to load resource in a PyFlink UDF

2021-04-27 Thread Yik San Chan
urces.zip, -pyarch makes more sense than -pyfs. > > Regards, > Dian > > 2021年4月27日 下午5:14,Yik San Chan 写道: > > Hi Dian, > > Thank you! That solves my question. By the way, for my use case, does > -pyarch make more sense than -pyfs? > > Best, > Yik San > >

Re: Confusing docs on python.archives

2021-04-27 Thread Yik San Chan
://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table-api-users-guide/dependency_management.html > > 2021年4月27日 上午8:17,Yik San Chan 写道: > > Hi Dian, > > I wonder where can we specify the target directory? > > Best, > Yik San > > On Mon, Apr 26, 202

Re: How to load resource in a PyFlink UDF

2021-04-27 Thread Yik San Chan
> Dian > > 2021年4月27日 下午4:39,Yik San Chan 写道: > > Hi, > > My UDF has the dependency to a resource file named crypt.csv that is > located in resources/ directory. > > ```python > # udf_use_resource.py > @udf(input_types=[DataTypes.STRING()], result_type=DataType

How to load resource in a PyFlink UDF

2021-04-27 Thread Yik San Chan
Hi, My UDF has the dependency to a resource file named crypt.csv that is located in resources/ directory. ```python # udf_use_resource.py @udf(input_types=[DataTypes.STRING()], result_type=DataTypes.STRING()) def decrypt(s): import pandas as pd d = pd.read_csv('resources/crypt.csv', header=None,

Re: Contradictory docs: python.files config can include not only python files

2021-04-27 Thread Yik San Chan
Hi Dian, I created a PR to fix the docs. https://github.com/apache/flink/pull/15779 On Tue, Apr 27, 2021 at 2:08 PM Dian Fu wrote: > Thanks for the suggestion. It makes sense to me~. > > 2021年4月27日 上午10:28,Yik San Chan 写道: > > Hi Dian, > > If that's the case, shal

Re: Contradictory docs: python.files config can include not only python files

2021-04-26 Thread Yik San Chan
he files which could be put in the PYTHONPATH are allowed here, e.g. > .zip, .whl, etc. > > Regards, > Dian > > 2021年4月27日 上午8:16,Yik San Chan 写道: > > Hi Dian, > > It is still not clear to me - does it only allow Python files (.py), or > not? > > Best, > Yik San

Re: Confusing docs on python.archives

2021-04-26 Thread Yik San Chan
rectory with > the specified name.` > > Regards, > Dian > > 2021年4月26日 下午8:57,Yik San Chan 写道: > > Hi community, > > In > https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/python_config.html#python-options > , > > > For each archive file, a

Re: Contradictory docs: python.files config can include not only python files

2021-04-26 Thread Yik San Chan
not able to perform such kind of check. > Do you have any advice for this? > > Regards, > Dian > > 2021年4月26日 下午8:45,Yik San Chan 写道: > > Hi community, > > In > https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/python_config.html, > regarding pyth

Confusing docs on python.archives

2021-04-26 Thread Yik San Chan
Hi community, In https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/python_config.html#python-options , > For each archive file, a target directory is specified. If the target directory name is specified, the archive file will be extracted to a name can directory with the specified

Contradictory docs: python.files config can include not only python files

2021-04-26 Thread Yik San Chan
Hi community, In https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/python_config.html, regarding python.files: > Attach custom python files for job. This makes readers think only Python files are allowed here. However, in https://ci.apache.org/projects/flink/flink-docs-stable/dep

PyFlink: Shall we disallow relative URL for filesystem path?

2021-04-26 Thread Yik San Chan
Hi community, When using Filesystem SQL Connector, users need to provide a path. When running a PyFlink job using the mini-cluster mode by simply `python WordCount.py`, the path can be a relative path, such as, `words.txt`. However, trying to submit the job to `flink run` will fail without questio

Flink Hive connector: hive-conf-dir supports hdfs URI, while hadoop-conf-dir supports local path only?

2021-04-26 Thread Yik San Chan
Hi community, This question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67264156/flink-hive-connector-hive-conf-dir-supports-hdfs-uri-while-hadoop-conf-dir-sup In my current setup, local dev env can access testing env. I would like to run Flink job on local dev env, whil

Re: Flink: Not able to sink a stream into csv

2021-04-21 Thread Yik San Chan
ITH ( > 'connector' = 'filesystem', > 'format' = 'csv', > 'path' = '/tmp/output', > 'sink.rolling-policy.rollover-interval' = '10s' > ) > """) > >

Flink: Not able to sink a stream into csv

2021-04-21 Thread Yik San Chan
The question is cross posted on Stack Overflow https://stackoverflow.com/questions/67195207/flink-not-able-to-sink-a-stream-into-csv . I am trying to sink a stream into filesystem in csv format using PyFlink, however it does not work. ```python # stream_to_csv.py from pyflink.table import Environ

Re: PyFlink UDF: When to use vectorized vs scalar

2021-04-19 Thread Yik San Chan
> Dian > > 2021年4月19日 下午8:23,Yik San Chan 写道: > > Hi Dian, > > By "access data at row basis", do you mean, for input X, > > for row in X: > doSomething(row) > > If that's the case, I believe I am not accessing the vector like

Re: PyFlink UDF: When to use vectorized vs scalar

2021-04-19 Thread Yik San Chan
e ith row. > > Regards, > Dian > > 2021年4月19日 下午4:40,Yik San Chan 写道: > > Hmm one more question - as I said, there are 2 gains from using pandas UDF > - (1) smaller ser-de and invocation overhead, and (2) vector calculation. > > (2) depends on use cases, how about (1)

Re: Shall we add an option to ignore the header when flink sql consume filesystem csv source?

2021-04-19 Thread Yik San Chan
Hi Jin, Look forward to the implementation! Best, Yik San On Mon, Apr 19, 2021 at 5:58 PM JIN FENG wrote: > Hi, Yik San > > Yes, I'll implement this soon. > > On Mon, Apr 19, 2021 at 5:56 PM Yik San Chan > wrote: > >> Hi Jinfeng, >> >> Thanks for

Re: Shall we add an option to ignore the header when flink sql consume filesystem csv source?

2021-04-19 Thread Yik San Chan
Apr 19, 2021 at 5:26 PM Yik San Chan > wrote: > >> Hi community, >> >> According to >> https://stackoverflow.com/questions/65359382/apache-flink-sql-reference-guide-for-table-properties#comment115663505_65404207, >> there is no way to ignore csv header when f

Shall we add an option to ignore the header when flink sql consume filesystem csv source?

2021-04-19 Thread Yik San Chan
Hi community, According to https://stackoverflow.com/questions/65359382/apache-flink-sql-reference-guide-for-table-properties#comment115663505_65404207, there is no way to ignore csv header when flink sql consumes filesystem csv source. I think this is a reasonable behavior that users want. Shall

Re: PyFlink UDF: When to use vectorized vs scalar

2021-04-19 Thread Yik San Chan
Hmm one more question - as I said, there are 2 gains from using pandas UDF - (1) smaller ser-de and invocation overhead, and (2) vector calculation. (2) depends on use cases, how about (1)? Is the benefit (1) always-true? Best, Yik San On Mon, Apr 19, 2021 at 4:33 PM Yik San Chan wrote: >

Re: PyFlink UDF: When to use vectorized vs scalar

2021-04-19 Thread Yik San Chan
on that. > > I am ccing Dian Fu who is more familiar with pyflink > > Best, > Fabian > > On 16. Apr 2021, at 11:04, Yik San Chan wrote: > > The question is cross-posted on Stack Overflow > https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vecto

PyFlink UDF: When to use vectorized vs scalar

2021-04-16 Thread Yik San Chan
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar Is there a simple set of rules to follow when deciding between vectorized vs scalar PyFlink UDF? According to [docs]( https://ci.apache.org/projects/flink/flink

Re: PyFlink: called already closed and NullPointerException

2021-04-15 Thread Yik San Chan
,中国,江西,2,1 > > You need to remove the double quote of “iPhone9,1". > > Definitely, we should improve the error message. I guess this is caused of > the same reason as the previous NPE issue and it should be addressed in > https://issues.apache.org/jira/browse/FLINK-22297. >

Re: PyFlink Vectorized UDF throws NullPointerException

2021-04-15 Thread Yik San Chan
ing up. > > 2021年4月16日 上午7:10,Yik San Chan 写道: > > Hi Dian, > > I wonder if we can improve the error tracing and message so that it > becomes more obvious where the problem is? To me, a NPE really says very > little. > > Best, > Yik San > > On Thu, Apr 15, 2021 a

PyFlink: called already closed and NullPointerException

2021-04-15 Thread Yik San Chan
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67118743/pyflink-called-already-closed-and-nullpointerexception . Hi community, I run into an issue where a PyFlink job may end up with 3 very different outcomes, given very slight difference in input, and luck :(

Re: PyFlink Vectorized UDF throws NullPointerException

2021-04-15 Thread Yik San Chan
Hi Dian, I wonder if we can improve the error tracing and message so that it becomes more obvious where the problem is? To me, a NPE really says very little. Best, Yik San On Thu, Apr 15, 2021 at 11:07 AM Dian Fu wrote: > Great! Thanks for letting me know~ > > 2021年4月15日 上午11:01,Yik

Re: PyFlink Vectorized UDF throws NullPointerException

2021-04-14 Thread Yik San Chan
andas UDF, the input type for each input argument is Pandas.Series > and the result type should also be a Pandas.Series. Besides, the length of > the result should be the same as the inputs. Could you check if this is the > case for your Pandas UDF implementation? > > Regards, > Dian

PyFlink Vectorized UDF throws NullPointerException

2021-04-14 Thread Yik San Chan
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67092978/pyflink-vectorized-udf-throws-nullpointerexception . I have a ML model that takes two numpy.ndarray - `users` and `items` - and returns an numpy.ndarray `predictions`. In normal Python code, I would do: ``

Re: JSON source for pyflink stream

2021-04-13 Thread Yik San Chan
Hi Giacomo, I think you can try using Flink SQL connector. For JSON input such as {"a": 1, "b": {"c": 2, {"d": 3}}}, you can do: CREATE TABLE data ( a INT, b ROW> ) WITH (...) Let me know if that helps. Best, Yik San On Wed, Apr 14, 2021 at 2:00 AM wrote: > Hi, > I'm new to Flink and I a

Re: Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

2021-04-09 Thread Yik San Chan
he content in the lib directory, then you have to > restart the cluster. > > Cheers, > Till > > On Fri, Apr 9, 2021 at 4:02 AM Yik San Chan > wrote: > >> Hi Till, I have 2 follow-ups. >> >> (1) Why is Hive special, while for connectors such as kafka, the docs

Re: Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

2021-04-08 Thread Yik San Chan
dependency in the > user jar which reduces its size. However, there shouldn't be anything > preventing you from bundling the Hive dependency with your user code if you > want to. > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hive/#depen

Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

2021-04-08 Thread Yik San Chan
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67001326/why-does-flink-quickstart-scala-suggests-adding-connector-dependencies-in-the-de . ## Connector dependencies should be in default scope This is what [flink-quickstart-scala]( https://github.com/apache/flin

Re: Will env.execute() and executeInsert() submit 2 jobs?

2021-04-07 Thread Yik San Chan
I believe you should delete. You can delete that and see if that reduces the extra job. On Wed, Apr 7, 2021 at 5:02 PM 王 浩成 wrote: > My program firstly did something on a data stream using DataStream API, > and then I converted it into a table and inserted it into another table for > sink, at th

Flink: Exception from container-launch exitCode=2

2021-04-06 Thread Yik San Chan
*The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/66968180/flink-exception-from-container-launch-exitcode-2 . Viewing the question on Stack Overflow is preferred as I in

Re: Why is Hive dependency flink-sql-connector-hive not available on Maven Central?

2021-04-06 Thread Yik San Chan
-connector-hive-2.3.6_2.12:1.12.2* > > On Tue, Apr 6, 2021 at 4:10 PM Yik San Chan > wrote: > >> Hi, >> >> I am able to find the jar from Maven central. See updates in the >> StackOverflow post. >> >> Thank you! >> >> Best, >> Yik Sa

Re: Why is Hive dependency flink-sql-connector-hive not available on Maven Central?

2021-04-06 Thread Yik San Chan
ely maintains the hive connectors. > > Cheers, > Gordon > > > On Fri, Apr 2, 2021 at 11:36 AM Yik San Chan > wrote: > >> The question is cross-posted in StackOverflow >> https://stackoverflow.com/questions/66914119/flink-why-is-hive-dependency-flink-sql-connector-hive-no

Why is Hive dependency flink-sql-connector-hive not available on Maven Central?

2021-04-01 Thread Yik San Chan
The question is cross-posted in StackOverflow https://stackoverflow.com/questions/66914119/flink-why-is-hive-dependency-flink-sql-connector-hive-not-available-on-maven-ce According to [Flink SQL Hive: Using bundled hive jar]( https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connect

Re: Flink Table to DataStream: how to access column name?

2021-03-31 Thread Yik San Chan
ausing confusion here. I fear that > you have to either use 1.13-SNAPSHOT or wait for the 1.13 release which > should happen in a couple of weeks if you really need this feature. > > [1] https://issues.apache.org/jira/browse/FLINK-19981 > > Cheers, > Till > > On Tue, Mar

Re: Flink Table to DataStream: how to access column name?

2021-03-30 Thread Yik San Chan
31, 2021 at 12:17 AM Till Rohrmann wrote: > There is a method Row.getFieldNames. > > Cheers, > Till > > On Tue, Mar 30, 2021 at 6:06 PM Yik San Chan > wrote: > >> Hi Till, >> >> I look inside the Row class, it does contain a member `private final >>

Re: Flink Table to DataStream: how to access column name?

2021-03-30 Thread Yik San Chan
uple3 you effectively lose the information > about the column names. You could also call `toRetractStream[Row]` which > will give you a `DataStream[Row]` where you keep the column names. > > Cheers, > Till > > On Tue, Mar 30, 2021 at 3:52 PM Yik San Chan > wrote: > >

Flink Table to DataStream: how to access column name?

2021-03-30 Thread Yik San Chan
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/66872184/flink-table-to-datastream-how-to-access-column-name . I want to consume a Kafka topic into a table using Flink SQL, then convert it back to a DataStream. Here is the `SOURCE_DDL`: ``` CREATE TABLE kafka_s

Could not find a suitable table factory for 'org.apache.flink.table.factories.CatalogFactory' in the classpath

2021-03-26 Thread Yik San Chan
This question is cross-posted on Stack Overflow https://stackoverflow.com/questions/66815572/could-not-find-a-suitable-table-factory-for-org-apache-flink-table-factories-ca . I am running a PyFlink program that reads from Hive `mysource` table, does some processing, then writes to Hive `mysink` ta

Re: Flink SQL client: SELECT 'hello world' throws [ERROR] Could not execute SQL statement

2021-03-26 Thread Yik San Chan
started > [2] > https://ci.apache.org/projects/flink/flink-docs-stable/deployment/resource-providers/standalone > > Regards, > David > > On Fri, Mar 26, 2021 at 9:50 AM Yik San Chan > wrote: > >> The question is cross-posted in Stack Overflow >> https://stac

Flink SQL client: SELECT 'hello world' throws [ERROR] Could not execute SQL statement

2021-03-26 Thread Yik San Chan
The question is cross-posted in Stack Overflow https://stackoverflow.com/questions/66813644/flink-sql-client-select-hello-world-throws-error-could-not-execute-sql-stat . I am following Flink SQL client docs https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html#dependencie

Re: flink sql jmh failure

2021-03-24 Thread Yik San Chan
Hi Jie, I am curious what library do you use to get the ClickHouseTableBuilder On Wed, Mar 24, 2021 at 8:41 PM jie mei wrote: > Hi, Community > > I run a jmh benchmark task get blew error, which use flink sql consuming > data from data-gen connector(10_000_000) and write data to clickhouse. ble

Re: Failed to unit test PyFlink UDF

2021-03-23 Thread Yik San Chan
Hi Dian, Thanks for your patience on all these asks! Best, Yik San On Wed, Mar 24, 2021 at 10:32 AM Dian Fu wrote: > It’s a good advice. I have created ticket > https://issues.apache.org/jira/browse/FLINK-21938 to track this. > > 2021年3月24日 上午10:24,Yik San Chan 写道: > > H

Re: Failed to unit test PyFlink UDF

2021-03-23 Thread Yik San Chan
24, 2021 at 10:21 AM Dian Fu wrote: > As I replied in previous email, it doesn’t block users to write tests for > PyFlink UDFs. Users could use ._func to access the original Python function > if they want. > > Regards, > Dian > > 2021年3月23日 下午2:39,Yik San Chan 写道: > &

Re: Failed to unit test PyFlink UDF

2021-03-22 Thread Yik San Chan
? Best, Yik San On Tue, Mar 23, 2021 at 2:19 PM Dian Fu wrote: > Hi Yik San, > > This field isn't expected to be exposed to users and so I'm not convinced > that we should add such an interface/method in Flink. > > Regards, > Dian > > On Tue, Mar 23, 2021 at 2

Re: Failed to unit test PyFlink UDF

2021-03-22 Thread Yik San Chan
Hi Dian, The ._func method seems to be internal only. Maybe we can add some public-facing method to make it more intuitive for use in unit test? What do you think? Best, Yik San On Tue, Mar 23, 2021 at 2:02 PM Yik San Chan wrote: > Hi Dian, > > Thanks! It solves my problem. > &

Re: Failed to unit test PyFlink UDF

2021-03-22 Thread Yik San Chan
`, you will see the output is something like " 'pyflink.table.expression.Expression'>". > > You could try the following code: assert add._func(1, 1) == 3 > > add._func returns the original Python function. > > Regards, > Dian > > On Tue, Mar 23, 202

Failed to unit test PyFlink UDF

2021-03-22 Thread Yik San Chan
(This question is cross-posted on StackOverflow https://stackoverflow.com/questions/66756612/failed-to-unit-test-pyflink-udf ) I am using PyFlink and I want to unit test my UDF written in Python. To test the simple udf below: ```python # tasks/helloworld/udf.py from pyflink.table import DataType

Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-19 Thread Yik San Chan
e created https://issues.apache.org/jira/browse/FLINK-21876 to follow > up with this issue. > > Regards, > Dian > > 2021年3月19日 下午6:57,Xingbo Huang 写道: > > Yes, you need to ensure that the key and value types of the Map are > determined > > Best, > Xingbo > > Yik S

Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-19 Thread Yik San Chan
I got why regarding the simplified question - the dummy parser should return key(string)-value(string), otherwise it fails the result_type spec On Fri, Mar 19, 2021 at 3:37 PM Yik San Chan wrote: > Hi Dian, > > I simplify the question in > https://stackoverflow.com/questions/6668

Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-19 Thread Yik San Chan
e.tasks.StreamTask.invoke(StreamTask.java:589) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570) at java.lang.Thread.run(Thread.java:748) ``` The issue is probably related to the udf. Any help? Thanks! Best, Yik San

Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-18 Thread Yik San Chan
Hi Dian, I am able to reproduce this issue in a much simpler setup. Let me update with the simpler reproducible example shortly. Best, Yik San On Fri, Mar 19, 2021 at 11:28 AM Yik San Chan wrote: > Hi Dian, > > It is a good catch, though after changing to use > flink-sql-connecto

Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-18 Thread Yik San Chan
e jar files in the cluster nodes are also built with Scala 2.12? PyFlink > package bundles jar files with Scala 2.11 by default. I’m still not sure if > it’s related to this issue. However, I think this is problematic. Could you > make sure that they are consistent? > > > 202

Re: PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-18 Thread Yik San Chan
d matches the > PyFlink version. > > Regards, > Dian > > 2021年3月18日 下午5:01,Yik San Chan 写道: > > This question is cross-posted on StackOverflow > https://stackoverflow.com/questions/66687797/pyflink-java-io-eofexception-at-java-io-datainputstream-readint > > I have

PyFlink java.io.EOFException at java.io.DataInputStream.readInt

2021-03-18 Thread Yik San Chan
This question is cross-posted on StackOverflow https://stackoverflow.com/questions/66687797/pyflink-java-io-eofexception-at-java-io-datainputstream-readint I have a PyFlink job that reads from Kafka source, transform, and write to Kafka sink. This is a `tree` view of my working directory. ``` > t

Re: Why does Flink FileSystem sink splits into multiple files

2021-03-15 Thread Yik San Chan
Thank you, it works. Best, Yik San Chan On Mon, Mar 15, 2021 at 5:30 PM David Anderson wrote: > The first time you ran it without having specified the parallelism, and so > you got the default parallelism -- which is greater than 1 (probably 4 or > 8, depending on how many cores your

Re: Got “pyflink.util.exceptions.TableException: findAndCreateTableSource failed.” when running PyFlink example

2021-03-15 Thread Yik San Chan
Thanks for your help, it works. Best, Yik San Chan On Tue, Mar 16, 2021 at 10:03 AM Xingbo Huang wrote: > Hi, > > The problem is that the legacy DataSet you are using does not support the > FileSystem connector you declared. You can use blink Planner to achieve

Re: flink参数问题

2021-03-15 Thread Yik San Chan
Hi lxk, Can you please translate the question to English, and provide more info so that people can help? Thanks. On Mon, Mar 15, 2021 at 2:20 PM lxk7...@163.com wrote: > > 大佬们,我现在flink的版本是flink 1.10,但是我通过-ynm 指定yarn上的任务名称不起作用,一直显示的是Flink per-job > cluster > > -- > l

Why does Flink FileSystem sink splits into multiple files

2021-03-15 Thread Yik San Chan
.field('count', DataTypes.BIGINT())) \ .with_schema(Schema() .field('word', DataTypes.STRING()) .field('count', DataTypes.BIGINT())) \ .create_temporary_table('mySink') tab = t_env.from_path('mySource') tab.group_by(tab.word) \ .select(tab.word, lit(1).count) \ .execute_insert('mySink').wait() ``` Running this version will generate a /tmp/output. Note it doesn't come with comma delimiter. ``` > cat /tmp/output flink 2 pyflink 1 ``` Any idea why? Thanks! Best, Yik San Chan

Got “pyflink.util.exceptions.TableException: findAndCreateTableSource failed.” when running PyFlink example

2021-03-14 Thread Yik San Chan
(The question is cross-posted on StackOverflow https://stackoverflow.com/questions/66632765/got-pyflink-util-exceptions-tableexception-findandcreatetablesource-failed-w ) I am running below PyFlink program (copied from https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table_a

Can I use PyFlink together with PyTorch/Tensorflow/PyTorch

2021-03-14 Thread Yik San Chan
Hi community, I am exploring PyFlink and I wonder if it is possible to use PyFlink together with all these ML libs that ML engineers normally use: PyTorch, Tensorflow, Scikit Learn, Xgboost, LightGBM, etc. According to this SO thread

  1   2   >