Re: pipeline with parquet and sql
In terms of schema and ParquetIO source/sink, there was an answer in some previous thread: Currently (without introducing any change in ParquetIO) there is no way to not pass the avro schema. It will probably be replaced with Beam's schema in the future () [1] https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E wt., 31 lip 2018 o 10:19 Akanksha Sharma B napisał(a): > Hi, > > > I am hoping to get some hints/pointers from the experts here. > > I hope the scenario described below was understandable. I hope it is a > valid use-case. Please let me know if I need to explain the scenario > better. > > > Regards, > > Akanksha > > -- > *From:* Akanksha Sharma B > *Sent:* Friday, July 27, 2018 9:44 AM > *To:* dev@beam.apache.org > *Subject:* Re: pipeline with parquet and sql > > > Hi, > > > Please consider following pipeline:- > > > Source is Parquet file, having hundreds of columns. > > Sink is Parquet. Multiple output parquet files are generated after > applying some sql joins. Sql joins to be applied differ for each output > parquet file. Lets assume we have a sql queries generator or some > configuration file with the needed info. > > > Can this be implemented generically, such that there is no need of the > schema of the parquet files involved or any intermediate POJO or beam > schema. > > i.e. the way spark can handle it - read parquet into dataframe, create > temp view and apply sql queries to it, and write it back to parquet. > > As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs > avro schemas. Ideally we dont want to see POJOs or schemas. > If there is a way we can achieve this with beam, please do help. > > Regards, > Akanksha > > ------------------ > *From:* Akanksha Sharma B > *Sent:* Tuesday, July 24, 2018 4:47:25 PM > *To:* u...@beam.apache.org > *Subject:* pipeline with parquet and sql > > > Hi, > > > Please consider following pipeline:- > > > Source is Parquet file, having hundreds of columns. > > Sink is Parquet. Multiple output parquet files are generated after > applying some sql joins. Sql joins to be applied differ for each output > parquet file. Lets assume we have a sql queries generator or some > configuration file with the needed info. > > > Can this be implemented generically, such that there is no need of the > schema of the parquet files involved or any intermediate POJO or beam > schema. > > i.e. the way spark can handle it - read parquet into dataframe, create > temp view and apply sql queries to it, and write it back to parquet. > > As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs > avro schemas. Ideally we dont want to see POJOs or schemas. > If there is a way we can achieve this with beam, please do help. > > Regards, > Akanksha > > > >
Re: pipeline with parquet and sql
Sorry, I sent not finished message. In terms of schema and ParquetIO source/sink, there was an answer in some previous thread [1]. Currently (without introducing any change in ParquetIO) there is no way to not pass the avro schema. It will probably be replaced with Beam's schema in the future [2]. [1] https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E [2] https://issues.apache.org/jira/browse/BEAM-4812 wt., 31 lip 2018 o 12:43 Łukasz Gajowy napisał(a): > In terms of schema and ParquetIO source/sink, there was an answer in some > previous thread: > > Currently (without introducing any change in ParquetIO) there is no way to > not pass the avro schema. It will probably be replaced with Beam's schema > in the future () > > [1] > https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E > > > wt., 31 lip 2018 o 10:19 Akanksha Sharma B > napisał(a): > >> Hi, >> >> >> I am hoping to get some hints/pointers from the experts here. >> >> I hope the scenario described below was understandable. I hope it is a >> valid use-case. Please let me know if I need to explain the scenario >> better. >> >> >> Regards, >> >> Akanksha >> >> ---------- >> *From:* Akanksha Sharma B >> *Sent:* Friday, July 27, 2018 9:44 AM >> *To:* dev@beam.apache.org >> *Subject:* Re: pipeline with parquet and sql >> >> >> Hi, >> >> >> Please consider following pipeline:- >> >> >> Source is Parquet file, having hundreds of columns. >> >> Sink is Parquet. Multiple output parquet files are generated after >> applying some sql joins. Sql joins to be applied differ for each output >> parquet file. Lets assume we have a sql queries generator or some >> configuration file with the needed info. >> >> >> Can this be implemented generically, such that there is no need of the >> schema of the parquet files involved or any intermediate POJO or beam >> schema. >> >> i.e. the way spark can handle it - read parquet into dataframe, create >> temp view and apply sql queries to it, and write it back to parquet. >> >> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO >> needs avro schemas. Ideally we dont want to see POJOs or schemas. >> If there is a way we can achieve this with beam, please do help. >> >> Regards, >> Akanksha >> >> -- >> *From:* Akanksha Sharma B >> *Sent:* Tuesday, July 24, 2018 4:47:25 PM >> *To:* u...@beam.apache.org >> *Subject:* pipeline with parquet and sql >> >> >> Hi, >> >> >> Please consider following pipeline:- >> >> >> Source is Parquet file, having hundreds of columns. >> >> Sink is Parquet. Multiple output parquet files are generated after >> applying some sql joins. Sql joins to be applied differ for each output >> parquet file. Lets assume we have a sql queries generator or some >> configuration file with the needed info. >> >> >> Can this be implemented generically, such that there is no need of the >> schema of the parquet files involved or any intermediate POJO or beam >> schema. >> >> i.e. the way spark can handle it - read parquet into dataframe, create >> temp view and apply sql queries to it, and write it back to parquet. >> >> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO >> needs avro schemas. Ideally we dont want to see POJOs or schemas. >> If there is a way we can achieve this with beam, please do help. >> >> Regards, >> Akanksha >> >> >> >>
Re: pipeline with parquet and sql
On Wed, Aug 1, 2018 at 1:12 AM Akanksha Sharma B < akanksha.b.sha...@ericsson.com> wrote: > Hi, > > > Thanks. I understood the Parquet point. I will wait for couple of days on > this topic. Even if this scenario cannot be achieved now, any design > document or future plans towards this direction will also be helpful to me. > > > To summarize, I do not understand beam well enough, can someone please > help me and comment whether the following fits with beam's model and > future direction :- > > "read parquet (along with inferred schema) into something like dataframe > or Beam Rows. And vice versa for write i.e. get rows and write parquet > based on Row's schema." > Beam currently does not have a standard message format. A Beam pipeline consists of PCollections and transforms (that converts PCollections to other PCollections). You can transform the PCollection read from Parquet using a ParDo and writing the resulting transform back to Parquet format. I think Schema aware PCollections [1] might be close to what you need but not sure if it fulfills your exact requirement. Thanks, Cham [1] https://lists.apache.org/thread.html/fe327866c6c81b7e55af28f81cedd9b2e588279def330940e8b8ebd7@%3Cdev.beam.apache.org%3E > > > > Regards, > > Akanksha > > > -- > *From:* Łukasz Gajowy > *Sent:* Tuesday, July 31, 2018 12:43:32 PM > *To:* u...@beam.apache.org > *Cc:* dev@beam.apache.org > *Subject:* Re: pipeline with parquet and sql > > In terms of schema and ParquetIO source/sink, there was an answer in some > previous thread: > > Currently (without introducing any change in ParquetIO) there is no way to > not pass the avro schema. It will probably be replaced with Beam's schema > in the future () > > [1] > https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E > > > wt., 31 lip 2018 o 10:19 Akanksha Sharma B > napisał(a): > > Hi, > > > I am hoping to get some hints/pointers from the experts here. > > I hope the scenario described below was understandable. I hope it is a > valid use-case. Please let me know if I need to explain the scenario > better. > > > Regards, > > Akanksha > > -- > *From:* Akanksha Sharma B > *Sent:* Friday, July 27, 2018 9:44 AM > *To:* dev@beam.apache.org > *Subject:* Re: pipeline with parquet and sql > > > Hi, > > > Please consider following pipeline:- > > > Source is Parquet file, having hundreds of columns. > > Sink is Parquet. Multiple output parquet files are generated after > applying some sql joins. Sql joins to be applied differ for each output > parquet file. Lets assume we have a sql queries generator or some > configuration file with the needed info. > > > Can this be implemented generically, such that there is no need of the > schema of the parquet files involved or any intermediate POJO or beam > schema. > > i.e. the way spark can handle it - read parquet into dataframe, create > temp view and apply sql queries to it, and write it back to parquet. > > As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs > avro schemas. Ideally we dont want to see POJOs or schemas. > If there is a way we can achieve this with beam, please do help. > > Regards, > Akanksha > > -- > *From:* Akanksha Sharma B > *Sent:* Tuesday, July 24, 2018 4:47:25 PM > *To:* u...@beam.apache.org > *Subject:* pipeline with parquet and sql > > > Hi, > > > Please consider following pipeline:- > > > Source is Parquet file, having hundreds of columns. > > Sink is Parquet. Multiple output parquet files are generated after > applying some sql joins. Sql joins to be applied differ for each output > parquet file. Lets assume we have a sql queries generator or some > configuration file with the needed info. > > > Can this be implemented generically, such that there is no need of the > schema of the parquet files involved or any intermediate POJO or beam > schema. > > i.e. the way spark can handle it - read parquet into dataframe, create > temp view and apply sql queries to it, and write it back to parquet. > > As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs > avro schemas. Ideally we dont want to see POJOs or schemas. > If there is a way we can achieve this with beam, please do help. > > Regards, > Akanksha > > > >