Re: using thin jar to replace fat jar on yarn cluster mode

2019-12-22 文章 Rui Li
Hi,

I think you can try specifying dependent jars with the -C option[1] when
you submit the job, and see if that meets your needs.

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/cli.html#usage

On Mon, Dec 23, 2019 at 10:09 AM zjfpla...@hotmail.com <
zjfpla...@hotmail.com> wrote:

> Hi,
> Does flink on yarn support using thin jar to replace fat jar?
> I don't want the jar of each flink task to have hundreds of MB. I
> want to put all the dependent packages in a single directory,and then the
> size of each flink task jar will be tens of KB.
>
> --
> zjfpla...@hotmail.com
>


-- 
Best regards!
Rui Li


Re: Flink On K8s, build docker image very slowly, is there some way to make it faster?

2019-12-22 文章 vino yang
Hi Lake,

Can you clearly count or identify which steps are taking a long time?

Best,
Vino

LakeShen  于2019年12月23日周一 下午2:46写道:

> Hi community , when I run the flink task on k8s , the first thing is that
> to build the flink task jar to
> Docker Image . I find that It would spend much time to build docker image.
> Is there some way to makr it faster.
> Thank your replay.
>


Re: Flink On K8s, build docker image very slowly, is there some way to make it faster?

2019-12-22 文章 Xintong Song
Hi Lake,

Usually building a docker image should not take much time (typically less
than 2 minutes).

It is probably the network issue that causes the long time for image
building. Of course we will need more information (e.g., logs) to confirm
that, but according to our experience pulling the base image (based on
which the Flink on K8s image will be built) from DockerHub could take quite
some time from mainland China (where I assume you are since you're also
writing to user-zh).

If this is indeed the case that you met, you can try to modify the
"flink-container/docker/Dockerfile", change the line "FROM
openjdk:8-jre-alpine" to point to a domestic or local image source.

Thank you~

Xintong Song



On Mon, Dec 23, 2019 at 2:46 PM LakeShen  wrote:

> Hi community , when I run the flink task on k8s , the first thing is that
> to build the flink task jar to
> Docker Image . I find that It would spend much time to build docker image.
> Is there some way to makr it faster.
> Thank your replay.
>


Flink On K8s, build docker image very slowly, is there some way to make it faster?

2019-12-22 文章 LakeShen
Hi community , when I run the flink task on k8s , the first thing is that
to build the flink task jar to
Docker Image . I find that It would spend much time to build docker image.
Is there some way to makr it faster.
Thank your replay.


Re: 使用flink 做维表关联

2019-12-22 文章 LakeShen
Flink l.9 SQL 中支持 HBase 作为维表,不过是没有缓存的,直接来一条,去HBase 查询一条,我们这边使用 HBase
,反正2 QPS 能够处理到。
这种场景,应该能够 Cover 一些常见的场景的。
当然,如果你们公司有其他的存储,你可以在 SQL 中自定义维表即可。具体查看 LookupableTableSource。

Best wishes,
LakeShen

lucas.wu  于2019年12月20日周五 下午5:37写道:

> hi 大家好:
>
> 最近有在调研使用flink做实时数仓,但是有个问题没弄清楚,就是明细表和维度表做join的时候,该采取什么的方案?目前的想到的就是明细表通过流消费进来,维度表放缓存。但是这种方案有弊端,就是维度表更新后,历史join过的数据无法再更新。不知道大家还有什么其他的方案?ps:目前有看到flink有支持join,这种需要两个表都是流的方式进入flink,然后会将历史的数据保存在state里面,这种对于量大的表会不会有问题?


Re: Re: using thin jar to replace fat jar on yarn cluster mode

2019-12-22 文章 zjfpla...@hotmail.com
When using yarn originally, we found a problem. The jars on yarn will take 
precedence over the jars on the specified classpath, and sometimes yarn is 
shared, and the jars in the lib directory of yarn cannot be modified. When 
Flink runs on a yarn cluster, is the jar on the classpath of flink itself 
higher than the jar on yarn?

zjfpla...@hotmail.com

From: tangjunli...@huitongjy.com
Date: 2019-12-23 10:34
To: user-zh
Subject: Re: using thin jar to replace fat jar on yarn cluster mode
Specify classpath



tangjunli...@huitongjy.com
From: zjfpla...@hotmail.com
Date: 2019-12-23 10:09
To: user; user-zh
Subject: using thin jar to replace fat jar on yarn cluster mode
Hi,
Does flink on yarn support using thin jar to replace fat jar?
I don't want the jar of each flink task to have hundreds of MB. I want 
to put all the dependent packages in a single directory,and then the size of 
each flink task jar will be tens of KB.

zjfpla...@hotmail.com


Re: using thin jar to replace fat jar on yarn cluster mode

2019-12-22 文章 tangjunli...@huitongjy.com
Specify classpath



tangjunli...@huitongjy.com
 
From: zjfpla...@hotmail.com
Date: 2019-12-23 10:09
To: user; user-zh
Subject: using thin jar to replace fat jar on yarn cluster mode
Hi,
Does flink on yarn support using thin jar to replace fat jar?
I don't want the jar of each flink task to have hundreds of MB. I want 
to put all the dependent packages in a single directory,and then the size of 
each flink task jar will be tens of KB.
 

zjfpla...@hotmail.com


using thin jar to replace fat jar on yarn cluster mode

2019-12-22 文章 zjfpla...@hotmail.com
Hi,
Does flink on yarn support using thin jar to replace fat jar?
I don't want the jar of each flink task to have hundreds of MB. I want 
to put all the dependent packages in a single directory,and then the size of 
each flink task jar will be tens of KB.


zjfpla...@hotmail.com


Re: [DISCUSS] What parts of the Python API should we focus on next ?

2019-12-22 文章 jincheng sun
Hi Bowen,

Your suggestions are very helpful for expanding the PyFlink ecology.  I
also mentioned above to integrate notebooks,Jupyter and Zeppelin are both
very excellent notebooks. The process of integrating Jupyter and Zeppelin
also requires the support of Jupyter and Zeppelin community personnel.
Currently Jeff has made great efforts in Zeppelin community for PyFink. I
would greatly appreciate if anyone who active in the Jupyter community also
willing to help to integrate PyFlink.

Best,
Jincheng


Bowen Li  于2019年12月20日周五 上午12:55写道:

> - integrate PyFlink with Jupyter notebook
>- Description: users should be able to run PyFlink seamlessly in Jupyter
>- Benefits: Jupyter is the industrial standard notebook for data
> scientists. I’ve talked to a few companies in North America, they think
> Jupyter is the #1 way to empower internal DS with Flink
>
>
> On Wed, Dec 18, 2019 at 19:05 jincheng sun 
> wrote:
>
>> Also CC user-zh.
>>
>> Best,
>> Jincheng
>>
>>
>> jincheng sun  于2019年12月19日周四 上午10:20写道:
>>
>>> Hi folks,
>>>
>>> As release-1.10 is under feature-freeze(The stateless Python UDF is
>>> already supported), it is time for us to plan the features of PyFlink for
>>> the next release.
>>>
>>> To make sure the features supported in PyFlink are the mostly demanded
>>> for the community, we'd like to get more people involved, i.e., it would be
>>> better if all of the devs and users join in the discussion of which kind of
>>> features are more important and urgent.
>>>
>>> We have already listed some features from different aspects which you
>>> can find below, however it is not the ultimate plan. We appreciate any
>>> suggestions from the community, either on the functionalities or
>>> performance improvements, etc. Would be great to have the following
>>> information if you want to suggest to add some features:
>>>
>>> -
>>> - Feature description: 
>>> - Benefits of the feature: 
>>> - Use cases (optional): 
>>> --
>>>
>>> Features in my mind
>>>
>>> 1. Integration with most popular Python libraries
>>> - fromPandas/toPandas API
>>>Description:
>>>   Support to convert between Table and pandas.DataFrame.
>>>Benefits:
>>>   Users could switch between Flink and Pandas API, for example,
>>> do some analysis using Flink and then perform analysis using the Pandas API
>>> if the result data is small and could fit into the memory, and vice versa.
>>>
>>> - Support Scalar Pandas UDF
>>>Description:
>>>   Support scalar Pandas UDF in Python Table API & SQL. Both the
>>> input and output of the UDF is pandas.Series.
>>>Benefits:
>>>   1) Scalar Pandas UDF performs better than row-at-a-time UDF,
>>> ranging from 3x to over 100x (from pyspark)
>>>   2) Users could use Pandas/Numpy API in the Python UDF
>>> implementation if the input/output data type is pandas.Series
>>>
>>> - Support Pandas UDAF in batch GroupBy aggregation
>>>Description:
>>>Support Pandas UDAF in batch GroupBy aggregation of Python
>>> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame.
>>>Benefits:
>>>   1) Pandas UDAF performs better than row-at-a-time UDAF more
>>> than 10x in certain scenarios
>>>   2) Users could use Pandas/Numpy API in the Python UDAF
>>> implementation if the input/output data type is pandas.DataFrame
>>>
>>> 2. Fully support  all kinds of Python UDF
>>> - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
>>> give us some use case if you want this feature to be contained in the next
>>> release)
>>>   Description:
>>> Support UDAF in GroupBy aggregation.
>>>   Benefits:
>>> Users could define and use Python UDAF and use it in GroupBy
>>> aggregation. Without it, users have to use Java/Scala UDAF.
>>>
>>> - Support Python UDTF
>>>   Description:
>>>Support  Python UDTF in Python Table API & SQL
>>>   Benefits:
>>> Users could define and use Python UDTF in Python Table API &
>>> SQL. Without it, users have to use Java/Scala UDTF.
>>>
>>> 3. Debugging and Monitoring of Python UDF
>>>- Support User-Defined Metrics
>>>  Description:
>>>Allow users to define user-defined metrics and global job
>>> parameters with Python UDFs.
>>>  Benefits:
>>>UDF needs metrics to monitor some business or technical
>>> indicators, which is also a requirement for UDFs.
>>>
>>>- Make the log level configurable
>>>  Description:
>>>Allow users to config the log level of Python UDF.
>>>  Benefits:
>>>Users could configure different log levels when debugging and
>>> deploying.
>>>
>>> 4. Enrich the Python execution environment
>>>- Docker Mode Support
>>>  Description:
>>>  Support running python UDF in docker workers.
>>>  Benefits:
>>>  Support various of deployments to meet more users' requirements.
>>>
>