Re: pipeline with parquet and sql

Akanksha Sharma B Wed, 01 Aug 2018 01:13:22 -0700

Hi,


Thanks. I understood the Parquet point. I will wait for couple of days on this 
topic. Even if this scenario cannot be achieved now, any design document or 
future plans towards this direction will also be helpful to me.


To summarize, I do not understand beam well enough, can someone please help me 
and comment whether the following fits with beam's model and future direction :-

"read parquet (along with inferred schema) into something like dataframe or 
Beam Rows. And vice versa for write i.e. get rows and write parquet based on 
Row's schema."



Regards,

Akanksha


________________________________
From: Łukasz Gajowy <[email protected]>
Sent: Tuesday, July 31, 2018 12:43:32 PM
To: [email protected]
Cc: [email protected]
Subject: Re: pipeline with parquet and sql

In terms of schema and ParquetIO source/sink, there was an answer in some 
previous thread:

Currently (without introducing any change in ParquetIO) there is no way to not 
pass the avro schema. It will probably be replaced with Beam's schema in the 
future ()

[1] 
https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E


wt., 31 lip 2018 o 10:19 Akanksha Sharma B 
<[email protected]<mailto:[email protected]>> 
napisał(a):

Hi,


I am hoping to get some hints/pointers from the experts here.

I hope the scenario described below was understandable. I hope it is a valid 
use-case. Please let me know if I need to explain the scenario better.


Regards,

Akanksha

________________________________
From: Akanksha Sharma B
Sent: Friday, July 27, 2018 9:44 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying 
some sql joins. Sql joins to be applied differ for each output parquet file. 
Lets assume we have a sql queries generator or some configuration file with the 
needed info.


Can this be implemented generically, such that there is no need of the schema 
of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp 
view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro 
schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha


________________________________
From: Akanksha Sharma B
Sent: Tuesday, July 24, 2018 4:47:25 PM
To: [email protected]<mailto:[email protected]>
Subject: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying 
some sql joins. Sql joins to be applied differ for each output parquet file. 
Lets assume we have a sql queries generator or some configuration file with the 
needed info.


Can this be implemented generically, such that there is no need of the schema 
of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp 
view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs avro 
schemas. Ideally we dont want to see POJOs or schemas.
If there is a way we can achieve this with beam, please do help.

Regards,
Akanksha

Re: pipeline with parquet and sql

Reply via email to