Re: Splitting in Stream Formats for File Source

2023-08-21 Thread Chirag Dewan via user
 Thanks Ron.
For HDFS, a reasonable level of parallelism is reading multiple blocks in 
parallel. Ofcourse that could mean losing the ordering that a file usually 
guarantees. Now if I understand correctly, this may become a problem in 
watermarking. But with smaller files having bounded high watermarks, I feel 
this is a good tradeoff. 
So if every new line in my AVRO encoded Parquet file in HDFS is an AVRO record, 
do you think splittable StreamFormat is a possibility? 
Thanks 
On Sunday, 20 August, 2023 at 01:11:42 pm IST, liu ron  
wrote:  
 
 Hi,
Regarding CSV and AvroParquet stream formats doens't supporting splits, I think 
some hints may be available from [1]. Personally, I think the main 
consideration should be the question of how the row format can find a 
reasonable split point, and how many Splits are appropriate to slice a file 
more than one. For Orc and other columnar formats, within a file, it 
will be further split according to the RowGroup, Page, etc. However, row 
formats do not have such information, maybe we can not find a suitable basis 
for split.


[1] 
https://github.com/apache/flink/blob/9546f8243a24e7b45582b6de6702f819f1d73f97/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/StreamFormat.java#L57
Best,Ron
Chirag Dewan via user  于2023年8月17日周四 12:00写道:

Hi,I am trying to collect files from HDFS in my DataStream job. I need to 
collect two types of files - CSV and Parquet. 
I understand that Flink supports both formats, but in Streaming mode, Flink 
doesnt support splitting these formats. Splitting is only supported in Table 
API.
I wanted to understand the thought process around this and why splitting is not 
supported in CSV and AvroParquet Stream formats? As far as my understanding 
goes, splitting would work fine with HDFS blocks and multiple blocks can be 
read in parallel. 
Maybe I am missing some fundamental aspect about this. 
Would like to understand more if someone can point me in the right 
direction.Thanks

  

Re: Splitting in Stream Formats for File Source

2023-08-20 Thread liu ron
Hi,

Regarding CSV and AvroParquet stream formats doens't supporting splits, I
think some hints may be available from [1]. Personally, I think the main
consideration should be the question of how the row format can find a
reasonable split point, and how many Splits are appropriate to slice a file
more than one. For Orc and other columnar formats, within a file,
it will be further split according to the RowGroup, Page, etc. However, row
formats do not have such information, maybe we can not find a suitable
basis for split.


[1]
https://github.com/apache/flink/blob/9546f8243a24e7b45582b6de6702f819f1d73f97/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/StreamFormat.java#L57

Best,
Ron

Chirag Dewan via user  于2023年8月17日周四 12:00写道:

> Hi,
> I am trying to collect files from HDFS in my DataStream job. I need to
> collect two types of files - CSV and Parquet.
>
> I understand that Flink supports both formats, but in Streaming mode,
> Flink doesnt support splitting these formats. Splitting is only supported
> in Table API.
>
> I wanted to understand the thought process around this and why splitting
> is not supported in CSV and AvroParquet Stream formats? As far as my
> understanding goes, splitting would work fine with HDFS blocks and multiple
> blocks can be read in parallel.
>
> Maybe I am missing some fundamental aspect about this.
>
> Would like to understand more if someone can point me in the right
> direction.
> Thanks
>
>


Splitting in Stream Formats for File Source

2023-08-16 Thread Chirag Dewan via user
Hi,I am trying to collect files from HDFS in my DataStream job. I need to 
collect two types of files - CSV and Parquet. 
I understand that Flink supports both formats, but in Streaming mode, Flink 
doesnt support splitting these formats. Splitting is only supported in Table 
API.
I wanted to understand the thought process around this and why splitting is not 
supported in CSV and AvroParquet Stream formats? As far as my understanding 
goes, splitting would work fine with HDFS blocks and multiple blocks can be 
read in parallel. 
Maybe I am missing some fundamental aspect about this. 
Would like to understand more if someone can point me in the right 
direction.Thanks