Hi,
I am hoping to get some hints/pointers from the experts here. I hope the scenario described below was understandable. I hope it is a valid use-case. Please let me know if I need to explain the scenario better. Regards, Akanksha ________________________________ From: Akanksha Sharma B Sent: Friday, July 27, 2018 9:44 AM To: [email protected] Subject: Re: pipeline with parquet and sql Hi, Please consider following pipeline:- Source is Parquet file, having hundreds of columns. Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info. Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema. i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet. As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas. If there is a way we can achieve this with beam, please do help. Regards, Akanksha ________________________________ From: Akanksha Sharma B Sent: Tuesday, July 24, 2018 4:47:25 PM To: [email protected] Subject: pipeline with parquet and sql Hi, Please consider following pipeline:- Source is Parquet file, having hundreds of columns. Sink is Parquet. Multiple output parquet files are generated after applying some sql joins. Sql joins to be applied differ for each output parquet file. Lets assume we have a sql queries generator or some configuration file with the needed info. Can this be implemented generically, such that there is no need of the schema of the parquet files involved or any intermediate POJO or beam schema. i.e. the way spark can handle it - read parquet into dataframe, create temp view and apply sql queries to it, and write it back to parquet. As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro schemas. Ideally we dont want to see POJOs or schemas. If there is a way we can achieve this with beam, please do help. Regards, Akanksha
