flink hive batch作业报FileNotFoundException

2021-05-27 文章 bowen li
Hi,大家好
我执行的是batch table写入hive时,会出现FileNotFound的错误,找不到.staging文件
   版本是 1.12.1 搭建方式是 standalone
   报错信息如下:
   
   11:28
Caused by: java.lang.Exception: Failed to finalize execution on master
  ... 33 more
Caused by: org.apache.flink.table.api.TableException: Exception in 
finalizeGlobal
  at 
org.apache.flink.table.filesystem.FileSystemOutputFormat.finalizeGlobal(FileSystemOutputFormat.java:94)
  at 
org.apache.flink.runtime.jobgraph.InputOutputFormatVertex.finalizeOnMaster(InputOutputFormatVertex.java:148)
  at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.vertexFinished(ExecutionGraph.java:1368)
  ... 32 more
Caused by: java.io.FileNotFoundException: File 
hdfs://nameservice1/user/hive/warehouse/flinkdb.db/ods_csc_zcdrpf_test_utf8/.staging_1622171066985
 does not exist.
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:986)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:122)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1046)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1043)
  at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1053)
  at 
org.apache.flink.hive.shaded.fs.hdfs.HadoopFileSystem.listStatus(HadoopFileSystem.java:170)
  at 
org.apache.flink.table.filesystem.PartitionTempFileManager.headCheckpoints(PartitionTempFileManager.java:137)
  at 
org.apache.flink.table.filesystem.FileSystemCommitter.commitUpToCheckpoint(FileSystemCommitter.java:93)
  at 
org.apache.flink.table.filesystem.FileSystemOutputFormat.finalizeGlobal(FileSystemOutputFormat.java:92)
  ... 34 more

flink集群提交任务挂掉

2021-04-01 文章 bowen li
Hi,大家好:
 现在我们遇到的场景是这样的,提交任务的时候会报错。我们使用的版本是1.12.1,搭建模式是standalone的。下面是报错信息。

   java.lang.OutOfMemoryError: Direct buffer memory. The direct out-of-memory 
error has occurred. This can mean two things: either job(s) require(s) a larger 
size of JVM direct memory or there is a direct memory leak. The direct memory 
can be allocated by user code or some of its dependencies. In this case 
'taskmanager.memory.task.off-heap.size' configuration option should be 
increased. Flink framework and its dependencies also consume the direct memory, 
mostly for network communication. The most of network memory is managed by 
Flink and should not result in out-of-memory error. In certain special cases, 
in particular for jobs with high parallelism, the framework may require more 
direct memory which is not managed by Flink. In this case 
'taskmanager.memory.framework.off-heap.size' configuration option should be 
increased. If the error persists then there is probably a direct memory leak in 
user code or some of its dependencies
  这种情况我们需要特别的配置吗?

Re: 使用Flink1.10.0读取hive时source并行度问题

2020-03-05 文章 Bowen Li
@JingsongLee  把当前的hive sink并发度配置策略加到文档里吧
https://issues.apache.org/jira/browse/FLINK-16448

On Tue, Mar 3, 2020 at 9:31 PM Jun Zhang <825875...@qq.com> wrote:

>
> 嗯嗯,其实我觉得我写的这个示例sql应该是一个使用很广泛的sql,我新建了hive表,并且导入了数据之后,一般都会使用类似的sql来验证一下表建的对不对,数据是否正确。
>
>
>
>
>
>
>  在2020年03月4日 13:25,JingsongLee
>   Hi jun,
>
>
> Jira: https://issues.apache.org/jira/browse/FLINK-16413
> FYI
>
>
> Best,
> Jingsong Lee
>
>
> --
> From:JingsongLee  Send Time:2020年3月3日(星期二) 19:06
> To:Jun Zhang <825875...@qq.com; user-zh@flink.apache.org <
> user-zh@flink.apache.org
> Cc:user-zh@flink.apache.org  likeg...@163.com
> Subject:Re: 使用Flink1.10.0读取hive时source并行度问题
>
>
> Hi jun,
>
> 很好的建议~ 这是一个优化点~ 可以建一个JIRA
>
> Best,
> Jingsong Lee
>
>
> --
> From:Jun Zhang <825875...@qq.com
> Send Time:2020年3月3日(星期二) 18:45
> To:user-zh@flink.apache.org  lzljs3620...@aliyun.com
> Cc:user-zh@flink.apache.org  likeg...@163.com
> Subject:回复: 使用Flink1.10.0读取hive时source并行度问题
>
>
> 
> hi,jinsong:
> 我想说一个问题, 我开始了自动推断,比如我设置推断的最大并行度是10,
> 我有一个类似的sql select * from mytable limit 1;
> hive表mytable有超过10个文件,如果启动了10个并行度是不是有点浪费呢。
> 在2020年03月2日 16:38,JingsongLee 写道: 建议使用Batch模式来读取Hive table。
>
> Best,
> Jingsong Lee
>
>
> --
> From:like  Send Time:2020年3月2日(星期一) 16:35
> To:lzljs3620...@aliyun.com  Subject:回复: 使用Flink1.10.0读取hive时source并行度问题
>
>
> 我使用的是 StreamTableEnvironment,确实有碰到这个问题呢。
> 在2020年3月2日 16:16,JingsongLee 自动推断可能面临资源不足无法启动的问题
>
> 理论上不应该呀?Batch作业是可以部分运行的。
>
> Best,
> Jingsong Lee
>
> --
> From:like  Send Time:2020年3月2日(星期一) 15:35
> To:user-zh@flink.apache.org  lzljs3620...@aliyun.com  Subject:回复: 使用Flink1.10.0读取hive时source并行度问题
>
>
> 非常感谢!我尝试关闭自动推断后,已经可以控制source并行度了,自动推断可能面临资源不足无法启动的问题。
>
>
> 在2020年3月2日 15:18,JingsongLee 写道: Hi,
>
> 1.10中,Hive source是自动推断并发的,你可以使用以下参数配置到flink-conf.yaml里面来控制并发:
> - table.exec.hive.infer-source-parallelism=true (默认使用自动推断)
> - table.exec.hive.infer-source-parallelism.max=1000 (自动推断的最大并发)
>
> Sink的并发默认和上游的并发相同,如果有Shuffle,使用配置的统一并发。
>
> Best,
> Jingsong Lee
>
>
> --
> From:like  Send Time:2020年3月2日(星期一) 14:58
> To:user-zh@flink.apache.org  Subject:使用Flink1.10.0读取hive时source并行度问题
>
> hi,大家好
>
> 我使用flink1.10.0读取hive表,启动时设置了全局的并行度,但是在读取的时候,发现sink并行度是受约束的,
>
> 而source的并行度不受此约束,会根据source的大小改变,大的时候并行度大到1000,请问一下怎么处理这个并行度呢?


Re: Flink connect hive with hadoop HA

2020-02-10 文章 Bowen Li
Hi sunfulin,

Sounds like you didn't config the hadoop HA correctly on the client side
according to [1]. Let us know if it helps resolve the issue.

[1]
https://stackoverflow.com/questions/25062788/namenode-ha-unknownhostexception-nameservice1




On Mon, Feb 10, 2020 at 7:11 AM Khachatryan Roman <
khachatryan.ro...@gmail.com> wrote:

> Hi,
>
> Could you please provide a full stacktrace?
>
> Regards,
> Roman
>
>
> On Mon, Feb 10, 2020 at 2:12 PM sunfulin  wrote:
>
>> Hi, guys
>> I am using Flink 1.10 and test functional cases with hive intergration.
>> Hive with 1.1.0-cdh5.3.0 and with hadoop HA enabled.Running flink job I can
>> see successful connection with hive metastore, but cannot read table data
>> with exception:
>>
>> java.lang.IllegalArgumentException: java.net.UnknownHostException:
>> nameservice1
>> at
>> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
>> at
>> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
>> at
>> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
>> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:668)
>> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:604)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>>
>> I am running a standalone application. Looks like I am missing my hadoop
>> conf file in my flink job application classpath. Where should I config ?
>>
>>
>>
>>
>


Re: [ANNOUNCE] Dian Fu becomes a Flink committer

2020-01-16 文章 Bowen Li
Congrats!

On Thu, Jan 16, 2020 at 13:45 Peter Huang 
wrote:

> Congratulations, Dian!
>
>
> Best Regards
> Peter Huang
>
> On Thu, Jan 16, 2020 at 11:04 AM Yun Tang  wrote:
>
>> Congratulations, Dian!
>>
>> Best
>> Yun Tang
>> --
>> *From:* Benchao Li 
>> *Sent:* Thursday, January 16, 2020 22:27
>> *To:* Congxian Qiu 
>> *Cc:* d...@flink.apache.org ; Jingsong Li <
>> jingsongl...@gmail.com>; jincheng sun ; Shuo
>> Cheng ; Xingbo Huang ; Wei Zhong
>> ; Hequn Cheng ; Leonard Xu
>> ; Jeff Zhang ; user <
>> u...@flink.apache.org>; user-zh 
>> *Subject:* Re: [ANNOUNCE] Dian Fu becomes a Flink committer
>>
>> Congratulations Dian.
>>
>> Congxian Qiu  于2020年1月16日周四 下午10:15写道:
>>
>> > Congratulations Dian Fu
>> >
>> > Best,
>> > Congxian
>> >
>> >
>> > Jark Wu  于2020年1月16日周四 下午7:44写道:
>> >
>> >> Congratulations Dian and welcome on board!
>> >>
>> >> Best,
>> >> Jark
>> >>
>> >> On Thu, 16 Jan 2020 at 19:32, Jingsong Li 
>> wrote:
>> >>
>> >> > Congratulations Dian Fu. Well deserved!
>> >> >
>> >> > Best,
>> >> > Jingsong Lee
>> >> >
>> >> > On Thu, Jan 16, 2020 at 6:26 PM jincheng sun <
>> sunjincheng...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Congrats Dian Fu and welcome on board!
>> >> >>
>> >> >> Best,
>> >> >> Jincheng
>> >> >>
>> >> >> Shuo Cheng  于2020年1月16日周四 下午6:22写道:
>> >> >>
>> >> >>> Congratulations!  Dian Fu
>> >> >>>
>> >> >>> > Xingbo Wei Zhong  于2020年1月16日周四 下午6:13写道:  jincheng sun
>> >> >>> 于2020年1月16日周四 下午5:58写道:
>> >> >>>
>> >> >>
>> >> >
>> >> > --
>> >> > Best, Jingsong Lee
>> >> >
>> >>
>> >
>>
>> --
>>
>> Benchao Li
>> School of Electronics Engineering and Computer Science, Peking University
>> Tel:+86-15650713730
>> Email: libenc...@gmail.com; libenc...@pku.edu.cn
>>
>


Re: flink sql confluent schema avro topic注册成表

2020-01-07 文章 Bowen Li
Hi 陈帅,

这是一个非常合理的需求。我们需要开发一个 Flink ConfluentSchemaRegistryCatalog
完成元数据的获取。社区希望的用户体验是用户只需要给出confluent schema registry的链接,Flink SQL可以通过
ConfluentSchemaRegistryCatalog自动获取读写所需的信息,不再需要用户手动写DDL和format。

社区内部已经开始讨论了,我们应该会在1.11中完成,请关注
https://issues.apache.org/jira/browse/FLINK-12256


On Wed, Dec 18, 2019 at 6:46 AM 陈帅  wrote:

> 谢谢回复,有了schema registry url为何还需要填subject和avroSchemaStr呢?
>
> 朱广彬  于2019年12月18日周三 上午10:30写道:
>
> > Hi 陈帅,
> >
> > 目前社区确实不支持confluent schema registry的avro格式,我们内部也是依赖schema registry来做avro
> > schema的管理,所以,我们改动了flink-avro 的源码来支持。
> >
> > 主要涉及到这些地方:
> >
> >
> org.apache.flink.formats.avro.{AvroRowFormatFactory,AvroRowDeserializationSchema,AvroRowSerializationSchema}
> > 和org.apache.flink.table.descriptors.{Avro,AvroValidator}
> >
> > 使用时在构建Avro时指定以下三个参数即可(见标红部分):
> >
> > tableEnv.connect(
> > new Kafka()
> > .version("universal")
> > .topic(topic)
> > .properties(props)
> > ).withFormat(
> > new Avro()
> >   .useRegistry(true)
> >   .registryUrl(KAFKA_SCHEMA_REGISTRY_URL_ADDRESS)
> >   .registrySubject(subject)
> >   .avroSchema(avroSchemaStr)
> > )
> >
> >
> > 陈帅  于2019年12月18日周三 上午8:26写道:
> > >
> > > flink sql是否能够支持将confluent schema registry注册的一个avro数据格式
> 的topic注册成一张table?
> >
>


Re: [DISCUSS] What parts of the Python API should we focus on next ?

2019-12-19 文章 Bowen Li
- integrate PyFlink with Jupyter notebook
   - Description: users should be able to run PyFlink seamlessly in Jupyter
   - Benefits: Jupyter is the industrial standard notebook for data
scientists. I’ve talked to a few companies in North America, they think
Jupyter is the #1 way to empower internal DS with Flink


On Wed, Dec 18, 2019 at 19:05 jincheng sun  wrote:

> Also CC user-zh.
>
> Best,
> Jincheng
>
>
> jincheng sun  于2019年12月19日周四 上午10:20写道:
>
>> Hi folks,
>>
>> As release-1.10 is under feature-freeze(The stateless Python UDF is
>> already supported), it is time for us to plan the features of PyFlink for
>> the next release.
>>
>> To make sure the features supported in PyFlink are the mostly demanded
>> for the community, we'd like to get more people involved, i.e., it would be
>> better if all of the devs and users join in the discussion of which kind of
>> features are more important and urgent.
>>
>> We have already listed some features from different aspects which you can
>> find below, however it is not the ultimate plan. We appreciate any
>> suggestions from the community, either on the functionalities or
>> performance improvements, etc. Would be great to have the following
>> information if you want to suggest to add some features:
>>
>> -
>> - Feature description: 
>> - Benefits of the feature: 
>> - Use cases (optional): 
>> --
>>
>> Features in my mind
>>
>> 1. Integration with most popular Python libraries
>> - fromPandas/toPandas API
>>Description:
>>   Support to convert between Table and pandas.DataFrame.
>>Benefits:
>>   Users could switch between Flink and Pandas API, for example,
>> do some analysis using Flink and then perform analysis using the Pandas API
>> if the result data is small and could fit into the memory, and vice versa.
>>
>> - Support Scalar Pandas UDF
>>Description:
>>   Support scalar Pandas UDF in Python Table API & SQL. Both the
>> input and output of the UDF is pandas.Series.
>>Benefits:
>>   1) Scalar Pandas UDF performs better than row-at-a-time UDF,
>> ranging from 3x to over 100x (from pyspark)
>>   2) Users could use Pandas/Numpy API in the Python UDF
>> implementation if the input/output data type is pandas.Series
>>
>> - Support Pandas UDAF in batch GroupBy aggregation
>>Description:
>>Support Pandas UDAF in batch GroupBy aggregation of Python
>> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame.
>>Benefits:
>>   1) Pandas UDAF performs better than row-at-a-time UDAF more
>> than 10x in certain scenarios
>>   2) Users could use Pandas/Numpy API in the Python UDAF
>> implementation if the input/output data type is pandas.DataFrame
>>
>> 2. Fully support  all kinds of Python UDF
>> - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
>> give us some use case if you want this feature to be contained in the next
>> release)
>>   Description:
>> Support UDAF in GroupBy aggregation.
>>   Benefits:
>> Users could define and use Python UDAF and use it in GroupBy
>> aggregation. Without it, users have to use Java/Scala UDAF.
>>
>> - Support Python UDTF
>>   Description:
>>Support  Python UDTF in Python Table API & SQL
>>   Benefits:
>> Users could define and use Python UDTF in Python Table API & SQL.
>> Without it, users have to use Java/Scala UDTF.
>>
>> 3. Debugging and Monitoring of Python UDF
>>- Support User-Defined Metrics
>>  Description:
>>Allow users to define user-defined metrics and global job
>> parameters with Python UDFs.
>>  Benefits:
>>UDF needs metrics to monitor some business or technical
>> indicators, which is also a requirement for UDFs.
>>
>>- Make the log level configurable
>>  Description:
>>Allow users to config the log level of Python UDF.
>>  Benefits:
>>Users could configure different log levels when debugging and
>> deploying.
>>
>> 4. Enrich the Python execution environment
>>- Docker Mode Support
>>  Description:
>>  Support running python UDF in docker workers.
>>  Benefits:
>>  Support various of deployments to meet more users' requirements.
>>
>> 5. Expand the usage scope of Python UDF
>>- Support to use Python UDF via SQL client
>>  Description:
>>  Support to register and use Python UDF via SQL client
>>  Benefits:
>>  SQL client is a very important interface for SQL users. This
>> feature allows SQL users to use Python UDFs via SQL client.
>>
>>- Integrate Python UDF with Notebooks
>>  Description:
>>  Such as Zeppelin, etc (Especially Python dependencies)
>>
>>- Support to register Python UDF into catalog
>>   Description:
>>   Support to register Python UDF into catalog
>>   Benefits:
>>   1)Catalog is the centralized 

Re: Flink 与 Hive 集成问题

2019-05-14 文章 Bowen Li
Hi,

我们正在做 Flink-Hive 平台级的元数据和数据的集成,你可以关注下: flink-connector-hive

module,
Hive元数据 FLINK-11479 
,Hive数据 FLINK-10729  ,
计划1.9.0发布第一版

加下 Flink-Hive 钉钉官方群吧

[image: image.png]


On Mon, May 13, 2019 at 7:14 PM Yaoting Gong 
wrote:

> 大家好,
>  我是一个大数据新人。之前熟悉了下Flink Stream API 相关,用 Stream API 处理过kafka, es,hbase。
>  但目前调研 Flink
> SQL这块遇到问题,我们需要支持多个数据源之间的join,尤其是hive。希望能做成一个小平台,新任务通过添加配置即可完成。
> 我们的Flink 是1.7.1。 在和hive交互时遇到问题。如果我用jdbc方式连hive,性能肯定不够。如果我直接连
> 底层hdfs文件,那么好像需要用 batch环境,和我需要join的stream有冲突。
>
>希望大家能给点建议和思路,如果有相关项目,可以告知。目前我发现的FlinkStreamSQL项目,支持多个Source,但是没有hive
>
> 再次感谢
>


Re: Re:[进度更新] [讨论] Flink 对 Hive 的兼容 和 Catalogs

2019-03-29 文章 Bowen Li
感谢大家的回复!下一步我会整理好各位的反馈并转达给我们的团队。

同时欢迎加入 Flink-Hive 官方用户钉钉群讨论和反馈问题
[image: image.png]

On Wed, Mar 20, 2019 at 8:39 AM ChangTong He  wrote:

> >- *各位在使用哪个版本的Hive?有计划升级Hive吗?*
>
> 目前我维护的两套批处理系统分别是CDH5.10.0
>
> 和CDH5.13.1,均是hive-1.1.0;去年底搭了一套CDH6给开发做测试,但是目前我们调度大概有5000多个,有可能今年我们做IDC迁移的时候,顺便把集群都升到6的话,应该会升到对应的hive-2.1.1;
>
> >- *各位计划切换Hive引擎吗?有时间点吗?当Flink具备什么功能以后你才会考虑使用Flink读写Hive?*
>
>
> 没有计划,由于我们平台是启用了sentry,所以不知道Flink和sentry的契合度怎么样,批处理的话大部分任务都集中在夜间3-5点,也是最容易出问题的时段,如果Flink能够提供更好的failover能力以及对资源的把控性能更好,应该会考虑
>
> >- *各位使用Flink-Hive的动机是什么?只维护一套数据处理系统?使用Flink获取更好的性能?*
>
>
> 当初我了解Flink-Hive的动机,确实是希望只维护一套数据处理系统,目前我维护公司两个部门的大数据平台,他们批处理各一套,实时又各一套,hive还会通过phoenix写到另外一套hbase集群,脑壳疼
>
> >- *各位如何使用Hive?数据量有多大?主要是读,还是读写都有?*
>
> 大部分都MR2,hive on spark的任务较少但是不稳定,数据量TB级,读写都有
>
> >- *有多少Hive UDF?都是什么类型?*
>
> 有80多个UDF,看了一下他们的命名感觉大部分都是一些业务逻辑的判断
>
> >- *对项目有什么问题或者建议?*
>
> 主要还是追求稳定,以及对hive低版本的兼容吧(之前在给公司做spark-sql
> cli测试的时候,可以很明显的感觉到,开发其实是不愿意去改之前的代码的,他们希望的是能不动现有代码的情况下就能平滑的切换到新引擎)
> >
>
> 王志明 于2019年3月20日周三 下午8:47写道:
>
> > Hi,
> >  “Integrating Flink with Hive”确实是一个很大、很好的话题。针对以下几点,我结合自己的工作,pao'zhua
> > - *各位在使用哪个版本的Hive?有计划升级Hive吗?*
> > 目前用的是Apache Hive1.2,暂无升级Hive的计划
> >
> > - *各位计划切换Hive引擎吗?有时间点吗?当Flink具备什么功能以后你才会考虑使用Flink读写Hive?*
> > 一个是夜间会大批量跑任务,如果Flink读写Hive速度快,可处理数据量大,就会考虑用。
> >
> > - *各位使用Flink-Hive的动机是什么?只维护一套数据处理系统?使用Flink获取更好的性能?*
> > 希望流处理和批处理的核心代码是一套,方便开发,维护、以及数据准确性。
> >
> > - *各位如何使用Hive?数据量有多大?主要是读,还是读写都有?*
> > 希望是用 Flink on Hive 的方式,数据量有TB级,读写都有
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2019-03-20 09:28:55,"董鹏"  写道:
> > >1、首先对flink纳入阿里麾下表示很兴奋,针对以下问题,根据我的一些经验,抛砖引玉:
> > >hive太重要了,稳定,夜间跑任务,可以满足。
> > >   - *各位在使用哪个版本的Hive?有计划升级Hive吗?*// cdh5版本 无计划升级
> > >   -
> >
> *各位计划切换Hive引擎吗?有时间点吗?当Flink具备什么功能以后你才会考虑使用Flink读写Hive?*//尝试spark引擎跑夜间任务,不稳定。对于性能,不是特别追求,稳定了,就会尝试flink
> > on hive
> > >   -
> >
> *各位使用Flink-Hive的动机是什么?只维护一套数据处理系统?使用Flink获取更好的性能?*//技术迭代,当然理想的状况是批流统一,只维护一套数据处理系统。spark的性能已经很棒了,所以追求更好的性能这个对我们不需要。
> > >   - *各位如何使用Hive?数据量有多大?主要是读,还是读写都有?*//大的表 数据量不小,主要是读
> > >   - *有多少Hive UDF?都是什么类型?*//挺多
> > >   - *对项目有什么问题或者建议?*//1)flink on hive
> > 准实时场景下,对性能要求越高越好,相对的数据量不大。2)离线场景下,稳定,而后是性能。3)社区的活跃,排查问题的手段
> > >
> > >
> > >-- Original --
> > >From:  "Bowen Li";
> > >Date:  Wed, Mar 20, 2019 08:09 AM
> > >To:  "user-zh";
> > >
> > >Subject:  [进度更新] [讨论] Flink 对 Hive 的兼容 和 Catalogs
> > >
> > >
> > >Flink中文频道的童鞋们,大家好,
> > >
> > >*我们想收集下大家对Flink兼容Hive方面的需求和意见*。
> > >
> > >背景:去年12月的Flink Forward 中国站上,社区宣布了将推动Flink兼容Hive。今年2.21,在西雅图 Flink Meetup
> > >上我们做了 “Integrating Flink with Hive”
> > >的演讲,并进行了现场演示,收到很好的反响。现在已到三月中,我们已经在内部完成了构建Flink崭新的catalog架构,对Hive
> > >元数据的兼容,和常见的通过Flink 读写
> >
> >
> >Hive数据的工作。我们已开始提交相关的PR和设计文档,将已开发的功能输送回社区。欢迎大家参与到项目的各项工作中,如评审设计文档和PR,参与开发和测试。
> > >
> > >*当前最重要的事,是我们希望社区的同学们能分享各自对Hive的用法,并给我们的项目提供反馈和建议。*
> >
> >
> >我们已开始深入的在某些领域使Flink兼容Hive,各位的反馈和建议可以帮助我们更好地评估各个工作的优先度,从而使我们的用户能更快地得到各位需要的功能。比如,如果绝大多数用户都是以读Hive数据为主,我们就会高优优化读功能。
> > >
> > >快速回顾下我们内部已经完成的工作:
> > >
> > >   - Flink/Hive 元数据兼容
> > >  - 统一的、可查简化的catalog架构,用以管理catalog,database,tables, views,
> functions,
> > >  partitions, table/partition stats 等元数据
> > >  - 三种catalog实现:一种默认的内存catalog;HiveCatalog
> > >  用以兼容Hive生态的元数据;GenericHiveMetastoreCatalog 用以在Hive metastore中持久化
> > Flink
> > >  流和批的元数据
> > >  - 在SQL和table api中支持基于 ..<元数据名称> 的引用方式
> > >  - 统一的function catalog,并支持Hive 简单的 UDF
> > >   - Flink/Hive 数据兼容
> > >  - Hive connector 支持:读取分区和非分去表,partition pruning,Hive简单和复杂数据类型,简单的写
> > >   - 集成了了上述功能的SQL 客户端
> > >
> > >*我们想要了解的是:各位现在如何使用Hive?我们怎么能帮助各位解决问题?各位期待 Flink在兼容Hive中提供哪些功能?比如,*
> > >
> > >   - *各位在使用哪个版本的Hive?有计划升级Hive吗?*
> > >   - *各位计划切换Hive引擎吗?有时间点吗?当Flink具备什么功能以后你才会考虑使用Flink读写Hive?*
> > >   - *各位使用Flink-Hive的动机是什么?只维护一套数据处理系统?使用Flink获取更好的性能?*
> > >   - *各位如何使用Hive?数据量有多大?主要是读,还是读写都有?*
> > >   - *有多少Hive UDF?都是什么类型?*
> > >   - *对项目有什么问题或者建议?*
> > >
> > >大家的建议对我们很重要。我们希望这些工作能真正的尽快惠及社区用户。我们争取这周做个调查问卷,更全面的收集各位的反馈和建议。
> > >
> > >Bowen
> >
>


Re: [PROGRESS UPDATE] [DISCUSS] Flink-Hive Integration and Catalogs

2019-03-20 文章 Bowen Li
Thanks, Shaoxuan! I've sent a Chinese version to user-zh at the same time
yesterday.

>From feedbacks we received so far, supporting multiple older hive versions
is definitely one of our focuses next.

*More feedbacks are welcome from our community!*


On Tue, Mar 19, 2019 at 8:44 PM Shaoxuan Wang  wrote:

> Hi Bowen,
> Thanks for driving this. I am CCing this email/survey to user-zh@
> flink.apache.org as well.
> I heard there are lots of interests on Flink-Hive from the field. One of
> the biggest requests the hive users are raised is "the support of
> out-of-date hive version". A large amount of users are still working on the
> cluster with CDH/HDP installed with old hive version, say 1.2.1/2.1.1. We
> need ensure the support of these Hive version when planning the work on
> Flink-Hive integration.
>
> *@all. "We want to get your feedbacks on Flink-Hive integration." *
>
> Regards,
> Shaoxuan
>
> On Wed, Mar 20, 2019 at 7:16 AM Bowen Li  wrote:
>
>> Hi Flink users and devs,
>>
>> We want to get your feedbacks on integrating Flink with Hive.
>>
>> Background: In Flink Forward in Beijing last December, the community
>> announced to initiate efforts on integrating Flink and Hive. On Feb 21 
>> Seattle
>> Flink Meetup <https://www.meetup.com/seattle-flink/events/258723322/>,
>> We presented Integrating Flink with Hive
>> <https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-xuefu-zhang-and-bowen-li-seattle-flink-meetup-feb-2019>
>>  with
>> a live demo to local community and got great response. As of mid March now,
>> we have internally finished building Flink's brand-new catalog
>> infrastructure, metadata integration with Hive, and most common cases of
>> Flink reading/writing against Hive, and will start to submit more design
>> docs/FLIP and contribute code back to community. The reason for doing it
>> internally first and then in community is to ensure our proposed solutions
>> are fully validated and tested, gain hands-on experience and not miss
>> anything in design. You are very welcome to join this effort, from
>> design/code review, to development and testing.
>>
>> *The most important thing we believe you, our Flink users/devs, can help
>> RIGHT NOW is to share your Hive use cases and give us feedbacks for this
>> project. As we start to go deeper on specific areas of integration, you
>> feedbacks and suggestions will help us to refine our backlogs and
>> prioritize our work, and you can get the features you want sooner! *Just
>> for example, if most users is mainly only reading Hive data, then we can
>> prioritize tuning read performance over implementing write capability.
>> A quick review of what we've finished building internally and is ready to
>> contribute back to community:
>>
>>- Flink/Hive Metadata Integration
>>   - Unified, pluggable catalog infra that manages meta-objects,
>>   including catalogs, databases, tables, views, functions, partitions,
>>   table/partition stats
>>   - Three catalog impls - A in-memory catalog, HiveCatalog for
>>   embracing Hive ecosystem, GenericHiveMetastoreCatalog for persisting
>>   Flink's streaming/batch metadata in Hive metastore
>>   - Hierarchical metadata reference as
>>   .. in SQL and Table API
>>   - Unified function catalog based on new catalog infra, also
>>   support Hive simple UDF
>>- Flink/Hive Data Integration
>>   - Hive data connector that reads partitioned/non-partitioned Hive
>>   tables, and supports partition pruning, both Hive simple and complex 
>> data
>>   types, and basic write
>>- More powerful SQL Client fully integrated with the above features
>>and more Hive-compatible SQL syntax for better end-to-end SQL experience
>>
>> *Given above info, we want to learn from you on: How do you use Hive
>> currently? How can we solve your pain points? What features do you expect
>> from Flink-Hive integration? Those can be details like:*
>>
>>- *Which Hive version are you using? Do you plan to upgrade Hive?*
>>- *Are you planning to switch Hive engine? What timeline are you
>>looking at? Until what capabilities Flink has will you consider using 
>> Flink
>>with Hive?*
>>- *What's your motivation to try Flink-Hive? Maintain only one data
>>processing system across your teams for simplicity and maintainability?
>>Better performance of Flink over Hive itself?*
>>- *What are your Hive use cases? How large is your Hive data size? Do
>>you mainly do reading, or 

[进度更新] [讨论] Flink 对 Hive 的兼容 和 Catalogs

2019-03-19 文章 Bowen Li
Flink中文频道的童鞋们,大家好,

*我们想收集下大家对Flink兼容Hive方面的需求和意见*。

背景:去年12月的Flink Forward 中国站上,社区宣布了将推动Flink兼容Hive。今年2.21,在西雅图 Flink Meetup
上我们做了 “Integrating Flink with Hive”
的演讲,并进行了现场演示,收到很好的反响。现在已到三月中,我们已经在内部完成了构建Flink崭新的catalog架构,对Hive
元数据的兼容,和常见的通过Flink 读写
Hive数据的工作。我们已开始提交相关的PR和设计文档,将已开发的功能输送回社区。欢迎大家参与到项目的各项工作中,如评审设计文档和PR,参与开发和测试。

*当前最重要的事,是我们希望社区的同学们能分享各自对Hive的用法,并给我们的项目提供反馈和建议。*
我们已开始深入的在某些领域使Flink兼容Hive,各位的反馈和建议可以帮助我们更好地评估各个工作的优先度,从而使我们的用户能更快地得到各位需要的功能。比如,如果绝大多数用户都是以读Hive数据为主,我们就会高优优化读功能。

快速回顾下我们内部已经完成的工作:

   - Flink/Hive 元数据兼容
  - 统一的、可查简化的catalog架构,用以管理catalog,database,tables, views, functions,
  partitions, table/partition stats 等元数据
  - 三种catalog实现:一种默认的内存catalog;HiveCatalog
  用以兼容Hive生态的元数据;GenericHiveMetastoreCatalog 用以在Hive metastore中持久化 Flink
  流和批的元数据
  - 在SQL和table api中支持基于 ..<元数据名称> 的引用方式
  - 统一的function catalog,并支持Hive 简单的 UDF
   - Flink/Hive 数据兼容
  - Hive connector 支持:读取分区和非分去表,partition pruning,Hive简单和复杂数据类型,简单的写
   - 集成了了上述功能的SQL 客户端

*我们想要了解的是:各位现在如何使用Hive?我们怎么能帮助各位解决问题?各位期待 Flink在兼容Hive中提供哪些功能?比如,*

   - *各位在使用哪个版本的Hive?有计划升级Hive吗?*
   - *各位计划切换Hive引擎吗?有时间点吗?当Flink具备什么功能以后你才会考虑使用Flink读写Hive?*
   - *各位使用Flink-Hive的动机是什么?只维护一套数据处理系统?使用Flink获取更好的性能?*
   - *各位如何使用Hive?数据量有多大?主要是读,还是读写都有?*
   - *有多少Hive UDF?都是什么类型?*
   - *对项目有什么问题或者建议?*

大家的建议对我们很重要。我们希望这些工作能真正的尽快惠及社区用户。我们争取这周做个调查问卷,更全面的收集各位的反馈和建议。

Bowen