[jira] [Commented] (FLINK-17438) Flink StreamingFileSink chinese garbled

Sivaprasanna Sethuraman (Jira) Tue, 28 Apr 2020 22:19:22 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-17438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17095076#comment-17095076
 ]


Sivaprasanna Sethuraman commented on FLINK-17438:
-------------------------------------------------

[~why198852] I agree with [~lzljs3620320]. Can you check the charset your Hive 
is configured with? Java Strings completely support unicode so you can use 
Chinese characters with them. You just have to use the encoding as UTF-8.

Can you try setting {{SERDEPROPERTIES('serialization.encoding'='utf-8') }}in 
your Hive table DDL?

> Flink StreamingFileSink chinese garbled
> ---------------------------------------
>
>                 Key: FLINK-17438
>                 URL: https://issues.apache.org/jira/browse/FLINK-17438
>             Project: Flink
>          Issue Type: Bug
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.10.0
>         Environment: CDH6.0.1 hadoop3.0.0 Flink 1.10.0 
>            Reporter: 颖
>            Priority: Blocker
>
> val writer:CompressWriterFactory[String] = new 
> CompressWriterFactory[String](new DefaultExtractor[String]())
>  .withHadoopCompression(s"SnappyCodec")//${compress}
>  val fileConfig = 
> OutputFileConfig.builder().withPartPrefix(s"${prefix}").withPartSuffix(s"${suffix}").build()
>  val bulkFormatBuilder = StreamingFileSink.forBulkFormat(new Path(output), 
> writer)
>  // 自定义分桶策略
>  bulkFormatBuilder.withBucketAssigner(new DemoAssigner())
>  // 自定义输出文件配置
>  bulkFormatBuilder.withOutputFileConfig(fileConfig)
>  val sink = bulkFormatBuilder.build()
> // val rollingPolicy = 
> DefaultRollingPolicy.builder().withRolloverInterval(TimeUnit.MINUTES.toMillis(5)).withInactivityInterval(TimeUnit.MINUTES.toMillis(3)).withMaxPartSize(1
>  * 1024 * 1024)
> // val bulkFormatBuilder = StreamingFileSink.forRowFormat(new Path(output), 
> new SimpleStringEncoder[String]()).withRollingPolicy(rollingPolicy.build())
> // val sink = bulkFormatBuilder.build()
>  ds.map(_.log).addSink(sink).setParallelism(fileNum).name("snappy sink to 
> hdfs")
>  
> In this way, flink API is called and written to HDFS. There are Chinese 
> fields in the log, and the corresponding scrambled code is after hive is 
> resolved，
> CREATE EXTERNAL TABLE `demo_app`(
>  `str` string COMMENT '原始记录json')
> COMMENT 'app flink埋点日志'
> PARTITIONED BY ( 
>  `ymd` string COMMENT '日期分区yyyymmdd')
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.mapred.TextInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>  'hdfs://nameservice1/user/xxx/inke_back.db'
> kafka source data :
> {"name":"inke.dcc.flume.collect","type":"flume","status":"完成","batchDuration":3000,"proccessDelay":0,"shedulerDelay":0,"topic":"newserverlog_opd_operate_log","endpoint":"ali-a-opd-script01.bj","batchId":"xxx","batchTime":1588065997320,"numRecords":-1,"numBytes":-1,"totalRecords":0,"totalBytes":0,"ipAddr":"10.111.27.230"}
>  
> hive data :
> {"name":"inke.dcc.flume.collect","type":"flume","status":"������","batchDuration":3000,"proccessDelay":0,"shedulerDelay":0,"topic":"newserverlog_opd_operate_log","endpoint":"ali-a-opd-script01.bj","batchId":"xxx","batchTime":1588065997320,"numRecords":-1,"numBytes":-1,"totalRecords":0,"totalBytes":0,"ipAddr":"10.111.27.230"}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17438) Flink StreamingFileSink chinese garbled

Reply via email to